alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
50 stars 14 forks source link

Blob backup failing #1946

Open dsj976 opened 2 weeks ago

dsj976 commented 2 weeks ago

:white_check_mark: Checklist

:computer: System information

:package: Packages

List of packages ```none 2024-07-02 12:20:02 [WARNING]: Powershell version: 7.4.3 2024-07-02 12:20:02 [WARNING]: The currently supported version of Powershell is 7.4.1. 2024-07-02 12:20:02 [WARNING]: In case of errors originating from Powershell code, ensure that you are running the currently supported version. 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.PrivateDns module version: 1.0.4 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Resources module version: 6.11.1 2024-07-02 12:20:02 [SUCCESS]: [✔] Poshstache module version: 0.1.10 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.MonitoringSolutions module version: 0.1.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Applications module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Compute module version: 6.3.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.DataProtection module version: 2.1.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.OperationalInsights module version: 3.2.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Dns module version: 1.1.3 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Users module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Authentication module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Network module version: 6.2.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.RecoveryServices module version: 6.6.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Accounts module version: 2.13.1 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Monitor module version: 4.6.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Powershell-Yaml module version: 0.4.2 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Storage module version: 5.10.1 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.KeyVault module version: 4.12.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Microsoft.Graph.Identity.DirectoryManagement module version: 1.21.0 2024-07-02 12:20:02 [SUCCESS]: [✔] Az.Automation module version: 1.9.1 ```

:no_entry_sign: Describe the problem

The backup policy blobbackuppolicy that is created when deploying a TRE is creating a vaulted backup rule instead of an operational backup rule. This backup policy is associated with the bv-prod4-sre-<sre-id> backup vault. As a result the backup of the prod4<sre-id>data<random-letters> storage account, which contains three containers (ingress, egress and backup) fails with the error: "No containers selected for operation".

This is the JSON view of blobbackuppolicy:

``` { "properties": { "policyRules": [ { "backupParameters": { "backupType": "Discrete", "objectType": "AzureBackupParams" }, "trigger": { "schedule": { "repeatingTimeIntervals": [ "R/2023-03-26T13:00:00+00:00/P1W" ], "timeZone": "UTC" }, "taggingCriteria": [ { "tagInfo": { "tagName": "Default", "id": "Default_" }, "taggingPriority": 99, "isDefault": true } ], "objectType": "ScheduleBasedTriggerContext" }, "dataStore": { "dataStoreType": "VaultStore", "objectType": "DataStoreInfoBase" }, "name": "BackupWeekly", "objectType": "AzureBackupRule" }, { "lifecycles": [ { "deleteAfter": { "objectType": "AbsoluteDeleteOption", "duration": "P12W" }, "sourceDataStore": { "dataStoreType": "VaultStore", "objectType": "DataStoreInfoBase" } } ], "isDefault": true, "name": "Default", "objectType": "AzureRetentionRule" } ], "datasourceTypes": [ "Microsoft.Storage/storageAccounts/blobServices" ], "objectType": "BackupPolicy" }, "id": "/subscriptions/4aea9c2f-9b6c-42e8-8b09-3594994fe238/resourceGroups/RG_SHM_PROD4_SRE_SB123_BACKUP/providers/Microsoft.DataProtection/backupVaults/bv-prod4-sre-sb123/backupPolicies/blobbackuppolicy", "name": "blobbackuppolicy", "type": "Microsoft.DataProtection/backupVaults/backupPolicies" } ```

:steam_locomotive: Workarounds or solutions

We manually created a backup policy on the Azure Portal with the following JSON view, which successfully worked when running an on-demand backup:

``` { "properties": { "policyRules": [ { "lifecycles": [ { "deleteAfter": { "objectType": "AbsoluteDeleteOption", "duration": "P30D" }, "targetDataStoreCopySettings": [], "sourceDataStore": { "dataStoreType": "OperationalStore", "objectType": "DataStoreInfoBase" } } ], "isDefault": true, "name": "Default", "objectType": "AzureRetentionRule" }, { "lifecycles": [ { "deleteAfter": { "objectType": "AbsoluteDeleteOption", "duration": "P7D" }, "targetDataStoreCopySettings": [], "sourceDataStore": { "dataStoreType": "VaultStore", "objectType": "DataStoreInfoBase" } } ], "isDefault": true, "name": "Default", "objectType": "AzureRetentionRule" }, { "backupParameters": { "backupType": "Discrete", "objectType": "AzureBackupParams" }, "trigger": { "schedule": { "repeatingTimeIntervals": [ "R/2024-06-18T18:30:00+00:00/P1D" ], "timeZone": "UTC" }, "taggingCriteria": [ { "tagInfo": { "tagName": "Default", "id": "Default_" }, "taggingPriority": 99, "isDefault": true } ], "objectType": "ScheduleBasedTriggerContext" }, "dataStore": { "dataStoreType": "VaultStore", "objectType": "DataStoreInfoBase" }, "name": "BackupDaily", "objectType": "AzureBackupRule" } ], "datasourceTypes": [ "Microsoft.Storage/storageAccounts/blobServices" ], "objectType": "BackupPolicy" }, "id": "/subscriptions/4aea9c2f-9b6c-42e8-8b09-3594994fe238/resourceGroups/RG_SHM_PROD4_SRE_SB123_BACKUP/providers/Microsoft.DataProtection/backupVaults/bv-prod4-sre-sb123/backupPolicies/manualbackuppolicy", "name": "manualbackuppolicy", "type": "Microsoft.DataProtection/backupVaults/backupPolicies" } ```

(Jim) that policy can then be applied to the existing backup instance, or used in a new backup instance.

JimMadge commented 2 weeks ago

Blob backup docs.

I think the question here is how did a policy rule for vaulted backup get created.

JimMadge commented 2 weeks ago

I think the important difference between the two JSON configurations is that the working configuration has an extra policyRule

                "lifecycles": [
                    {
                        "deleteAfter": {
                            "objectType": "AbsoluteDeleteOption",
                            "duration": "P30D"
                        },
                        "targetDataStoreCopySettings": [],
                        "sourceDataStore": {
                            "dataStoreType": "OperationalStore",
                            "objectType": "DataStoreInfoBase"
                        }
                    }
                ],
                "isDefault": true,
                "name": "Default",
                "objectType": "AzureRetentionRule"
            },

where the dataStoreType is OperationalStore.

Presumably that is the operational backup, which is working. Both configurations have a policyRule where dataStoreType is VaultStore which presumably is the backup which is not working.

dsj976 commented 2 weeks ago

During meeting with @JimMadge and @craddm on 19/06/2024, decided that would be wise to manually delete blobbackuppolicy and its associated backup instance and create a new backup policy (e.g. called operationalblobbackuppolicy) with a single operational backup rule. This should be done for all production TREs to ensure that blob storage can be recovered in case of accidental deletion by a TRE user.

I have tested this in a sandbox TRE following these steps:

  1. Go to the backup vault bv-prod4-sre-<sre-id>, and on the menu on the left click Properties.
  2. Disable soft delete - this is necessary because otherwise Azure won't let you create a new backup instance
  3. On the menu on the left select Backup instances
  4. Click on the backup instance named prod4<sre-id>data<random-letters> associated with backup policy blobbackuppolicy
  5. Once inside, click Delete at the top and complete the fields to delete the backup instance
  6. If you disabled soft delete the backup instance should be completely erased, which will now allow you to create a new backup instance for prod4<sre-id>data<random-letters> with the new backup policy operationalblobbackuppolicy
  7. On the main page of the bv-prod4-sre-<sre-id> backup vault, click Backup policies on the menu on the left
  8. Click on Add to create a new backup policy and give it the name operationalblobbackuppolicy. Deselect Vaulted backup to ensure that the policy only has an operational backup rule. I left the retention rule at the default of 30 days.
  9. Now that you have deleted the old malfunctioning backup instance and have created a new operational backup policy, you can create a new backup instance for prod4<sre-id>data<random-letters> using operationalblobbackuppolicy
  10. On the main page of the bv-prod4-sre-<sre-id> backup vault, click Backup instances on the menu on the left
  11. Create a new backup instance by clicking Backup at the top of the page
  12. Leave the data source type as Azure Blobs
  13. On the next page select operationalblobbackuppolicy as your backup policy
  14. On the next page select the data source for this backup instance, which should be prod4<sre-id>data<random-letters>. Note that if you haven't completely deleted the old malfunctioning backup instance it won't let you complete this step
  15. Once Azure validates the parameters you can create the backup instance
  16. Blob recovery should now be available following this guide. I tested the recovery in a sandbox TRE and it worked successfully.
jemrobinson commented 1 week ago

This sounds like a great fix for TRESA - let's roll it out to production if we're sure it works. Ideally, we'd do this by scripting it - do any of you (@dsj976 @JimMadge @craddm) have time to work/cowork on fixing the current Setup_SRE_Backup.ps1 script so that it generates the correct type of policy?

dsj976 commented 1 week ago

@jemrobinson I am fixing this manually for production TREs. Can cowork on fixing Setup_SRE_Backup.ps1 once done with DSPT submission.

dsj976 commented 1 week ago

Important point: enabling operational backup on the data source (prod4<sre-id>data<random-letters> storage account) enforces the following changes in "Data protection":

jemrobinson commented 1 week ago

@jemrobinson I am fixing this manually for production TREs. Can cowork on fixing Setup_SRE_Backup.ps1 once done with DSPT submission.

I agree that production needs to be fixed ASAP, but if there are more than a couple of TREs to fix - it will be quicker and more reliable to script these steps. How many TREs will you be patching?

dsj976 commented 1 week ago

How many TREs will you be patching?

~Only two production TREs to fix. One done already.~ Update: only one production TRE had to have operational backup fixed. The other TRE had operational backup properly enabled.

jemrobinson commented 1 week ago

OK - are you able to perform any kind of test that the fix is working? Probably rolling-back data won't be appreciated, but maybe checking that the backups are running (initially triggered manually then later checking for the automated backup)?

dsj976 commented 1 week ago

I have done a test in a sandbox TRE where I deleted data logged in to a VM as a non-privileged user. I was able to recover the deleted files by restoring to a previous point in time using the operational backup.

From what I understood yesterday (maybe @JimMadge can correct me if I am wrong), what we want for blob storage (the prod4<sre-id>data<random-letters> storage account with the ingress, egress and backup containers) is operational backup, not vaulted backup. Operational backups, unlike vaulted, are not triggered. They simply create a continuous time history that you can use for recovery. I have left this at the default value of 30 days (i.e. you can recover to any point within the last 30 days).

Operational backup doesn't create a copy of the data in a separate location (i.e. the backup vault), unlike vaulted backup. So it's not intended for data recovery due to hardware failure. Instead it's intended to mitigate accidental data deletion by a TRE user. For blob storage, we are covered against hardware failure by using Geo-redundant storage (GRS) in two separate regions within the UK (UK South and West)

jemrobinson commented 1 week ago

Is it possible to enable both operational and vaulted backup? I can see why they're both useful in different scenarios.

dsj976 commented 1 week ago

Is it possible to enable both operational and vaulted backup? I can see why they're both useful in different scenarios.

Could be done but I think the vaulted backup might fail as was failing before. FYI, the Living with Machines project already had operational backup enabled and working properly. No vaulted backup. Just noticed this now. LWM was deployed in August 2023 so for some reason the backup policy that is automatically created at TRE deployment had an operational backup rule before (no vaulted backup rule), but now it's trying to enforce a vaulted backup instead (which fails). This is the JSON of the LWM backup policy (which was automatically created at TRE deployment):

``` { "properties": { "policyRules": [ { "lifecycles": [ { "deleteAfter": { "objectType": "AbsoluteDeleteOption", "duration": "P12W" }, "sourceDataStore": { "dataStoreType": "OperationalStore", "objectType": "DataStoreInfoBase" } } ], "isDefault": true, "name": "Default", "objectType": "AzureRetentionRule" } ], "datasourceTypes": [ "Microsoft.Storage/storageAccounts/blobServices" ], "objectType": "BackupPolicy" }, "id": "/subscriptions/a974c413-b955-4c98-9be2-585a08d91927/resourceGroups/RG_SHM_PROD4_SRE_LWMP4_BACKUP/providers/Microsoft.DataProtection/backupVaults/bv-prod4-sre-lwmp4/backupPolicies/blobbackuppolicy", "name": "blobbackuppolicy", "type": "Microsoft.DataProtection/backupVaults/backupPolicies" } ```
JimMadge commented 1 week ago

Is it possible to enable both operational and vaulted backup? I can see why they're both useful in different scenarios.

Vaulted backup is currently in preview for blob containers, so there are restrictions on how/where you can use it. I'm fairly sure it didn't exist at all when I initially did the backup work.

jemrobinson commented 1 week ago

Vaulted backup is currently in preview for blob containers, so there are restrictions on how/where you can use it. I'm fairly sure it didn't exist at all when I initially did the backup work.

I think this is exactly the problem. Your code uses the default behaviour and at some point MS have changed the default from operational to vaulted. Hence why older deployments are working as expected.

JimMadge commented 1 week ago

I think this is exactly the problem. Your code uses the default behaviour and at some point MS have changed the default from operational to vaulted. Hence why older deployments are working as expected.

Seems that way, or the default is now operational plus vaulted as that is what seemed to happen when an instance was created in the portal. The code is just creating a vaulted instance though.

dsj976 commented 1 week ago

I have another question regarding backup. Looking at the RG_SHM_PROD4_SRE_<SRE-ID>_BACKUP resource group for newer TRE deployments, there are two types of resources:

For older TRE deployments, in the backup resource group there are additional snapshots for the following disks: EGRESS, SHARED, HOME, SCRATCH.

Does this mean that for newer deployments we are not doing point-in-time backups of the shared, scratch and home TRE directories?

jemrobinson commented 1 week ago

Yes, we can't backup shared or home as they're NFS Azure Files containers (which don't support backup). scratch is local to each VM and can be reset at any time without warning (e.g. each time the VM reboots).

dsj976 commented 1 week ago

Ok, in that case I think we need to make clearer to TRE users that they need to be using the backup folder regularly. I can see it's written in the docs, but I don't think they are actively using it. I can see in the Azure portal that shared and home are in a storage account with zone-redundant storage replication (i.e. Azure backs up the data in three different physical locations within the same region), but we can't provide point-in-time backups for these.

Also, I think this statement in the docs is wrong:

If you are participating in a Turing Data Study Group, everything that is not stored in a GitLab repository or on the shared /shared/ or /output/ drives by Friday lunchtime will be DESTROYED FOR EVER.

I think what is currently happening is that once we run the tear down script, everything that is not in the output drive (which maps to the egress container which lives under the SHM subscription) or the BACKUP resource group gets deleted for ever. The tear down script removes all resource groups under the TRE subscription but the BACKUP resource group (see #1823 - I thought this was a bug but maybe was intentional). The GITLAB disk gets backed up in BACKUP, but if shared is not getting backed up then it's deleted once we run the tear down script.

JimMadge commented 1 week ago

@dsj976 Can you make a PR with your suggestions?

References to DSGs should have been removed from the DSH docs, so I think those must be things we missed.

My recollection is that backup should be removed in tear down. If yhou look at the scripts it should be clear what is intended. I don't think we should tell users data won't be removed in teardown because of a bug.

jemrobinson commented 1 day ago

N.B. This works without any changes to the code in Az.DataProtection == 0.4.0. Something must have changed between then and now.