Open dsj976 opened 2 weeks ago
Blob backup docs.
I think the question here is how did a policy rule for vaulted backup get created.
I think the important difference between the two JSON configurations is that the working configuration has an extra policyRule
"lifecycles": [
{
"deleteAfter": {
"objectType": "AbsoluteDeleteOption",
"duration": "P30D"
},
"targetDataStoreCopySettings": [],
"sourceDataStore": {
"dataStoreType": "OperationalStore",
"objectType": "DataStoreInfoBase"
}
}
],
"isDefault": true,
"name": "Default",
"objectType": "AzureRetentionRule"
},
where the dataStoreType
is OperationalStore
.
Presumably that is the operational backup, which is working.
Both configurations have a policyRule
where dataStoreType
is VaultStore
which presumably is the backup which is not working.
During meeting with @JimMadge and @craddm on 19/06/2024, decided that would be wise to manually delete blobbackuppolicy
and its associated backup instance and create a new backup policy (e.g. called operationalblobbackuppolicy
) with a single operational backup rule. This should be done for all production TREs to ensure that blob storage can be recovered in case of accidental deletion by a TRE user.
I have tested this in a sandbox TRE following these steps:
bv-prod4-sre-<sre-id>
, and on the menu on the left click Properties.prod4<sre-id>data<random-letters>
associated with backup policy blobbackuppolicy
prod4<sre-id>data<random-letters>
with the new backup policy operationalblobbackuppolicy
bv-prod4-sre-<sre-id>
backup vault, click Backup policies on the menu on the leftoperationalblobbackuppolicy
. Deselect Vaulted backup to ensure that the policy only has an operational backup rule. I left the retention rule at the default of 30 days.prod4<sre-id>data<random-letters>
using operationalblobbackuppolicy
bv-prod4-sre-<sre-id>
backup vault, click Backup instances on the menu on the leftoperationalblobbackuppolicy
as your backup policyprod4<sre-id>data<random-letters>
. Note that if you haven't completely deleted the old malfunctioning backup instance it won't let you complete this stepThis sounds like a great fix for TRESA - let's roll it out to production if we're sure it works. Ideally, we'd do this by scripting it - do any of you (@dsj976 @JimMadge @craddm) have time to work/cowork on fixing the current Setup_SRE_Backup.ps1
script so that it generates the correct type of policy?
@jemrobinson I am fixing this manually for production TREs. Can cowork on fixing Setup_SRE_Backup.ps1
once done with DSPT submission.
Important point: enabling operational backup on the data source (prod4<sre-id>data<random-letters>
storage account) enforces the following changes in "Data protection":
@jemrobinson I am fixing this manually for production TREs. Can cowork on fixing
Setup_SRE_Backup.ps1
once done with DSPT submission.
I agree that production needs to be fixed ASAP, but if there are more than a couple of TREs to fix - it will be quicker and more reliable to script these steps. How many TREs will you be patching?
How many TREs will you be patching?
~Only two production TREs to fix. One done already.~ Update: only one production TRE had to have operational backup fixed. The other TRE had operational backup properly enabled.
OK - are you able to perform any kind of test that the fix is working? Probably rolling-back data won't be appreciated, but maybe checking that the backups are running (initially triggered manually then later checking for the automated backup)?
I have done a test in a sandbox TRE where I deleted data logged in to a VM as a non-privileged user. I was able to recover the deleted files by restoring to a previous point in time using the operational backup.
From what I understood yesterday (maybe @JimMadge can correct me if I am wrong), what we want for blob storage (the prod4<sre-id>data<random-letters>
storage account with the ingress
, egress
and backup
containers) is operational backup, not vaulted backup. Operational backups, unlike vaulted, are not triggered. They simply create a continuous time history that you can use for recovery. I have left this at the default value of 30 days (i.e. you can recover to any point within the last 30 days).
Operational backup doesn't create a copy of the data in a separate location (i.e. the backup vault), unlike vaulted backup. So it's not intended for data recovery due to hardware failure. Instead it's intended to mitigate accidental data deletion by a TRE user. For blob storage, we are covered against hardware failure by using Geo-redundant storage (GRS) in two separate regions within the UK (UK South and West)
Is it possible to enable both operational and vaulted backup? I can see why they're both useful in different scenarios.
Is it possible to enable both operational and vaulted backup? I can see why they're both useful in different scenarios.
Could be done but I think the vaulted backup might fail as was failing before. FYI, the Living with Machines project already had operational backup enabled and working properly. No vaulted backup. Just noticed this now. LWM was deployed in August 2023 so for some reason the backup policy that is automatically created at TRE deployment had an operational backup rule before (no vaulted backup rule), but now it's trying to enforce a vaulted backup instead (which fails). This is the JSON of the LWM backup policy (which was automatically created at TRE deployment):
Is it possible to enable both operational and vaulted backup? I can see why they're both useful in different scenarios.
Vaulted backup is currently in preview for blob containers, so there are restrictions on how/where you can use it. I'm fairly sure it didn't exist at all when I initially did the backup work.
Vaulted backup is currently in preview for blob containers, so there are restrictions on how/where you can use it. I'm fairly sure it didn't exist at all when I initially did the backup work.
I think this is exactly the problem. Your code uses the default behaviour and at some point MS have changed the default from operational to vaulted. Hence why older deployments are working as expected.
I think this is exactly the problem. Your code uses the default behaviour and at some point MS have changed the default from operational to vaulted. Hence why older deployments are working as expected.
Seems that way, or the default is now operational plus vaulted as that is what seemed to happen when an instance was created in the portal. The code is just creating a vaulted instance though.
I have another question regarding backup. Looking at the RG_SHM_PROD4_SRE_<SRE-ID>_BACKUP
resource group for newer TRE deployments, there are two types of resources:
GITLAB
and CODIMD
disks only.ingress
, egress
and backup
containers.For older TRE deployments, in the backup resource group there are additional snapshots for the following disks: EGRESS
, SHARED
, HOME
, SCRATCH
.
Does this mean that for newer deployments we are not doing point-in-time backups of the shared
, scratch
and home
TRE directories?
Yes, we can't backup shared
or home
as they're NFS Azure Files containers (which don't support backup). scratch
is local to each VM and can be reset at any time without warning (e.g. each time the VM reboots).
Ok, in that case I think we need to make clearer to TRE users that they need to be using the backup
folder regularly. I can see it's written in the docs, but I don't think they are actively using it. I can see in the Azure portal that shared
and home
are in a storage account with zone-redundant storage replication (i.e. Azure backs up the data in three different physical locations within the same region), but we can't provide point-in-time backups for these.
Also, I think this statement in the docs is wrong:
If you are participating in a Turing Data Study Group, everything that is not stored in a GitLab repository or on the shared /shared/ or /output/ drives by Friday lunchtime will be DESTROYED FOR EVER.
I think what is currently happening is that once we run the tear down script, everything that is not in the output
drive (which maps to the egress
container which lives under the SHM subscription) or the BACKUP
resource group gets deleted for ever. The tear down script removes all resource groups under the TRE subscription but the BACKUP
resource group (see #1823 - I thought this was a bug but maybe was intentional). The GITLAB
disk gets backed up in BACKUP
, but if shared
is not getting backed up then it's deleted once we run the tear down script.
@dsj976 Can you make a PR with your suggestions?
References to DSGs should have been removed from the DSH docs, so I think those must be things we missed.
My recollection is that backup should be removed in tear down. If yhou look at the scripts it should be clear what is intended. I don't think we should tell users data won't be removed in teardown because of a bug.
N.B. This works without any changes to the code in Az.DataProtection == 0.4.0
. Something must have changed between then and now.
:white_check_mark: Checklist
:computer: System information
:package: Packages
List of packages
```none 2024-07-02 12:20:02 [WARNING]: Powershell version: 7.4.3 2024-07-02 12:20:02 [WARNING]: The currently supported version of Powershell is 7.4.1. 2024-07-02 12:20:02 [WARNING]: In case of errors originating from Powershell code, ensure that you are running the currently supported version. 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.PrivateDns module version: 1.0.4 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Resources module version: 6.11.1 2024-07-02 12:20:02 [SUCCESS]: [✔] Poshstache module version: 0.1.10 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.MonitoringSolutions module version: 0.1.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Applications module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Compute module version: 6.3.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.DataProtection module version: 2.1.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.OperationalInsights module version: 3.2.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Dns module version: 1.1.3 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Users module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Authentication module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Network module version: 6.2.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.RecoveryServices module version: 6.6.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Accounts module version: 2.13.1 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Monitor module version: 4.6.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Powershell-Yaml module version: 0.4.2 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Storage module version: 5.10.1 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.KeyVault module version: 4.12.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Identity.DirectoryManagement module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Automation module version: 1.9.1 ```:no_entry_sign: Describe the problem
The backup policy
blobbackuppolicy
that is created when deploying a TRE is creating a vaulted backup rule instead of an operational backup rule. This backup policy is associated with thebv-prod4-sre-<sre-id>
backup vault. As a result the backup of theprod4<sre-id>data<random-letters>
storage account, which contains three containers (ingress
,egress
andbackup
) fails with the error: "No containers selected for operation".This is the JSON view of
blobbackuppolicy
::steam_locomotive: Workarounds or solutions
We manually created a backup policy on the Azure Portal with the following JSON view, which successfully worked when running an on-demand backup:
(Jim) that policy can then be applied to the existing backup instance, or used in a new backup instance.