alan-turing-institute / data-safe-haven-team

Project board for the Data Safe Havens in the Cloud team
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Temporary documentation for update management #20

Open edwardchalstrey1 opened 1 year ago

edwardchalstrey1 commented 1 year ago

Step 1

Check updates: https://learn.microsoft.com/en-us/azure/update-center/quickstart-on-demand#check-updates

Works for some but not all

I selected the prod4 and edon subscriptions, the one's that failed the assessment were the compute VMs for sandbox (in prod4 sub) and edon

Screenshot 2023-03-27 at 16 09 40

Step 2

For those it will let us, install one-time updates:

Screenshot 2023-03-27 at 16 15 49

Works for some but not all

It refuses to do it for the VMs identified above, not surprising:

Screenshot 2023-03-27 at 16 17 57

Can we change the update settings for the affected VMs?

See docs that were linked by the above error: https://learn.microsoft.com/en-gb/azure/update-center/manage-update-settings?tabs=manage-single-overview%2Cmanage-scale-overview#configure-settings-on-single-vm

Screenshot 2023-03-27 at 16 24 36 Screenshot 2023-03-27 at 16 25 12

Answer seems to be no, error msg suggests it can't be done for VMs made with this particular image

Step 3

For the VMs where the updates are allowed, monitor the progress:

Screenshot 2023-03-27 at 16 52 27

Step 4

Although it definitely won't let us do a manual update (or even assessment) for the compute VMs, one solution could be to just deploy a new SRD (which presumably will have the most recent Linux patches) until we have fixed https://github.com/alan-turing-institute/data-safe-haven/issues/1403

Screenshot 2023-03-27 at 16 57 34

so instead log into the serial console of the compute VM and do (see below):

sudo apt update && sudo apt upgrade -y
sudo apt --fix-broken install -y
edwardchalstrey1 commented 1 year ago

@JimMadge @craddm I have found a temporary solution to update management I think should be acceptable for the meantime. For most of the linux/windows VMs that make up both the SHM and SREs, we are able to update via "Update management center" which is a differnent thing in Azure from the Update management section of the shm-prod4-automation "Automation account" we have set up.

The VMs that don't get updated this way are the compute VMs, which is for some reason to do with the image they use (see above), so my proposed temporary solution for these is to deploy a new SRD for the long-running SREs (e.g. edon). I assume that the new SRD will have the most recent linux updates, do you agree?

JimMadge commented 1 year ago

from here

Automatic VM guest patching, on-demand patch assessment and on-demand patch installation are supported only on VMs created from images with the exact combination of publisher, offer and sku from the below supported OS images list. Custom images or any other publisher, offer, sku combinations aren't supported. More images are added periodically.

If that is true I don't think this way of updating machines can work for us at all. It isn't feasible to build every SRD from base Ubuntu @jemrobinson. Did this not work before?

This is just another thing pushing me towards Ansible AWX...

JimMadge commented 1 year ago

The VMs that don't get updated this way are the compute VMs, which is for some reason to do with the image they use (see above), so my proposed temporary solution for these is to deploy a new SRD for the long-running SREs (e.g. edon). I assume that the new SRD will have the most recent linux updates, do you agree?

That is going to take a while and mean downtime and/or moving users to a new SRD. How often do updates need to be applied for DSPT?

I suspect it would be quicker and easier to run the equivalent commands on the SRDs as the admin user. That isn't ideal, but I don't think there is a reason you shouldn't do that. We would rely on 'safe people' there to have confidence you won't access any sensitive data.

edwardchalstrey1 commented 1 year ago

If that is true I don't think this way of updating machines can work for us at all. It isn't feasible to build every SRD from base Ubuntu @jemrobinson. Did this not work before?

Sure, I don't propose abandoning the current way of doing things though, I'm seeing this as a temporary fix for those it does work for whilst we don't have a solution for https://github.com/alan-turing-institute/data-safe-haven/issues/1403

That is going to take a while and mean downtime and/or moving users to a new SRD.

It's pretty quick and won't result in any downtime, the new SRD won't have any changes beyond apps being closed so moving over to it shouldn't be an issue - it's accessed in the exact same way

How often do updates need to be applied for DSPT?

I don't know, who knows this? @harisood ?

I suspect it would be quicker and easier to run the equivalent commands on the SRDs as the admin user.

Happy to do this, but what are they? Bear in mind this is general updates for all linux/windows VMs

JimMadge commented 1 year ago

If that is true I don't think this way of updating machines can work for us at all. It isn't feasible to build every SRD from base Ubuntu @jemrobinson. Did this not work before?

Sure, I don't propose abandoning the current way of doing things though, I'm seeing this as a temporary fix for those it does work for whilst we don't have a solution for alan-turing-institute/data-safe-haven#1403

But that was the error message from the current way of handling updates no? It clearly says that it will not work for custom images.

That is going to take a while and mean downtime and/or moving users to a new SRD.

It's pretty quick and won't result in any downtime, the new SRD won't have any changes beyond apps being closed so moving over to it shouldn't be an issue - it's accessed in the exact same way

But you will either have to kick users off and kill jobs on the 'old' SRD, or shut it down before deploying a new one. Either way it is much more disruptive than updating the packages in situ.

Also I think a new SRD will only be as up to date as the VM image that is deployed. So unless you build a new image each time as well, the new SRD won't have newer packages.

I suspect it would be quicker and easier to run the equivalent commands on the SRDs as the admin user.

Happy to do this, but what are they? Bear in mind this is general updates for all linux/windows VMs

Isn't it is only the SRDs that need an alternative way to be updated? apt update && apt upgrade -y

edwardchalstrey1 commented 1 year ago

Also I think a new SRD will only be as up to date as the VM image that is deployed. So unless you build a new image each time as well, the new SRD won't have newer packages.

Ah ok, if this is the case then I agree (but btw, it's possible to deploy multiple SRD's (compute VMs) per SRE, so wouldn't have resulted in downtime, but maybe you're right about people running jobs).

Isn't it is only the SRDs that need an alternative way to be updated?

Yes you're right, ok if you think apt update && apt upgrade -y is sufficient I'll run that on the SRD VMs in question

edwardchalstrey1 commented 1 year ago

Also had to run:

sudo apt --fix-broken install -y

after

sudo apt update && sudo apt upgrade -y

because

The following packages have unmet dependencies:
 nvidia-dkms-525 : Depends: nvidia-kernel-common-525 (<= 525.78.01-1) but it is not installed
                   Depends: nvidia-kernel-common-525 (>= 525.78.01) but it is not installed
 nvidia-driver-525 : Depends: nvidia-kernel-common-525 (<= 525.78.01-1) but it is not installed
                     Depends: nvidia-kernel-common-525 (>= 525.78.01) but it is not installed
                     Recommends: libnvidia-compute-525:i386 (= 525.78.01-0ubuntu0.20.04.1)
                     Recommends: libnvidia-decode-525:i386 (= 525.78.01-0ubuntu0.20.04.1)
                     Recommends: libnvidia-encode-525:i386 (= 525.78.01-0ubuntu0.20.04.1)
                     Recommends: libnvidia-fbc1-525:i386 (= 525.78.01-0ubuntu0.20.04.1)
                     Recommends: libnvidia-gl-525:i386 (= 525.78.01-0ubuntu0.20.04.1)
 nvidia-kernel-common-520 : Depends: nvidia-kernel-common-525 but it is not installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
jemrobinson commented 1 year ago

A couple of things here.

  1. Please don't mix the "Update management" option of Automation Account with the "Update management center". We're using the first one and although the second one is newer it still doesn't have a stable public release.
  2. Please don't make changes to a production system that haven't been tested and verified on a dev system. This is why we have separate dev systems in the first place.

Addressing some individual points below:

Answer seems to be no, error msg suggests it can't be done for VMs made with this particular image

Ubuntu 20.04 is one of the supported OSes at that link. That's what the SRD image you're using (20-04-2022112900) is based off.

Custom images or any other publisher, offer, sku combinations aren't supported. If that is true I don't think this way of updating machines can work for us at all. It isn't feasible to build every SRD from base Ubuntu @jemrobinson. Did this not work before?

This used to work. The base image is a supported image (and if you look at a deployed VM in the portal Ubuntu 20.04 is still listed as the image name). Has it definitely stopped working? Have we confirmed this on a new deployment?

Yes you're right, ok if you think apt update && apt upgrade -y is sufficient I'll run that on the SRD VMs in question

The Automation Account update management is basically just running apt update but managed by the portal rather than eg. a cronjob on the machine.

Check updates: https://learn.microsoft.com/en-us/azure/update-center/quickstart-on-demand#check-updates

This is for the "Update management center" which is not the solution we're trying to use here. If you look at the Automation Account, you can see that the problem is that the Automation Account doesn't see any Linux VMs as being registered with the Log Analytics workspace. It's not trying-and-failing to install updates, it isn't even seeing the VMs that updates need to be installed on. My guess is that there might be a network rule that's preventing this communication.