alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
60 stars 15 forks source link

Update management is not running for VMs of recently deployed SREs (except Guacamole) #1403

Closed edwardchalstrey1 closed 8 months ago

edwardchalstrey1 commented 1 year ago

:white_check_mark: Checklist

:computer: System information

:cactus: Powershell module versions

2023-02-24 09:58:58 [SUCCESS]: [✔] Powershell version: 7.2.6
2023-02-24 09:58:58 [SUCCESS]: [✔] Poshstache module version: 0.1.10
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.KeyVault module version: 4.6.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Accounts module version: 2.9.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Microsoft.Graph.Identity.DirectoryManagement module version: 1.10.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Microsoft.Graph.Authentication module version: 1.17.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Storage module version: 4.7.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Network module version: 4.18.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.PrivateDns module version: 1.0.3
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.RecoveryServices module version: 5.4.1
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Compute module version: 4.29.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Resources module version: 6.0.1
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Automation module version: 1.7.3
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Dns module version: 1.1.2
2023-02-24 09:58:58 [SUCCESS]: [✔] Powershell-Yaml module version: 0.4.2
2023-02-24 09:58:58 [SUCCESS]: [✔] Microsoft.Graph.Applications module version: 1.17.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.MonitoringSolutions module version: 0.1.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.OperationalInsights module version: 3.1.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Monitor module version: 3.0.1
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.DataProtection module version: 0.4.0

:no_entry_sign: Describe the problem

For recently deployed SREs (e.g. dsgkeep) the Update management is only working for Guacamole, but not the other VMs, including the compute VM. This has resulted in packages not being updated, for example, it caused the problem that resulted in #1401

Screenshot 2023-02-24 at 09 59 57

This is strange because it looks like the VM is connected to loganalytics:

Screenshot 2023-02-24 at 10 40 45

:steam_locomotive: Workarounds or solutions

jemrobinson commented 1 year ago

Can you change the scheduled time for sre-sbox2-linux-updates from the current value (every 7 days at starting next Tuesday at 02:02) to every 1 hours starting at 03/03/2023 15:00? This way it will run in 15 minutes time and we can see what it runs over.

edwardchalstrey1 commented 1 year ago

hmm, didn't seem to run, will investigate - oh, wrong date, well it should run at 4pm now

edwardchalstrey1 commented 1 year ago

Weird, said it was going to run at 4 but didn't Screenshot 2023-03-03 at 16 04 41 Screenshot 2023-03-03 at 16 04 33

jemrobinson commented 1 year ago

Maybe you just needed to wait for it to finish - seems to have run now.

Screenshot 2023-03-03 at 16 07 59

if you click on this you can see which VMs it ran over and it seems to be all 5 VMs in the SRE without missing any.

Screenshot 2023-03-03 at 16 08 46

so it's possible that the problem is that the query preview is not the same as what actually gets run? We can test this by changing another schedule that has been failing in the past.

edwardchalstrey1 commented 1 year ago

For dsggw for example the latest run looks like this - so I think you may be right that sbox2 isn't a good one to test this on:

Screenshot 2023-03-03 at 16 38 07

jemrobinson commented 1 year ago

OK, so can you update the timing on this one so it will run tonight? Just need to change start date from 28/02/2023 to 04/03/2023.

I think this might work since looking here (https://learn.microsoft.com/en-us/azure/automation/troubleshoot/update-management?WT.mc_id=Portal-Microsoft_Azure_Automation#nologs) I can see that

in this particular case, it looks like the problem is that the update job simply hasn't run recently.

Screenshot 2023-03-03 at 17 35 29
jemrobinson commented 1 year ago

Here's a VM that I think is not working and it seems to be due to an issue with the update agent.

Screenshot 2023-03-03 at 17 33 11
edwardchalstrey1 commented 1 year ago

Looks like Hybrid runbook worker is ok on this VM - I'll look at the Steps to fix Multihoming which we already identified earlier as having the extra log analytics workspace , but maybe we just need to reinstall "OMS-Agent-for-Linux" after you deleted the extra log analytics workspace @jemrobinson

I guess we expect the internet connectivity check to fail here as it's tier 3

Screenshot 2023-03-06 at 09 45 06
jemrobinson commented 1 year ago

@edwardchalstrey1 Did fixing multihoming help? If not, did you try reinstalling the OMS agent?

JimMadge commented 9 months ago

Have encountered the same bug. A VM reports being connected to the Log Analytics Workspace, and Automation Account is connected to the Log Analytics Workspace. However, the automation account does not apply updates.

After troubleshooting as Ed did above, Multihoming showed as failed again. Machine should not be multihomed (connected to >1 Log Analytics Workspace).

JimMadge commented 9 months ago

Possible workaround is check for this and redeploy

craddm commented 9 months ago

An additional point is adding an SRD manually using Add_Single_SRD.ps1 doesn't enable automatic updates or install the Oms agent, so needs to be documented that Setup_SRE_Monitoring.ps1 should also be run after adding an SRD

craddm commented 9 months ago

On a newly deployed SRD, there are what seem to be vestigial files for omsagent in /var/opt/microsoft/omsagent/.

image

This id is the spurious DefaultWorkspace ID, and it is from there that the multihoming issue arises. There is already some kind of record of a workspace showing there, even though the portal shows that there are no extensions installed etc. It is possible, from expecting the logs of this omsagent, that this is something MS have running during the deployment of the VM, which never completes because there is no internet access

image

So I'm wondering if this is supposed to be deleted once the setup process is complete, but never is.

jemrobinson commented 9 months ago

Great detective work @craddm! Could be a problem that occurs during the image building process? Might be worth adding something to the deployment-time cloud-init that deletes the /var/opt/microsoft/omsagent/ directory and seeing if that helps?

JimMadge commented 9 months ago

Perhaps the agent gets installed (or there is an attempt) during the build of the SRD image? Ensuring those files are deleted, if they exist, in cloud-init could work as long as that happens before the agent gets installed?

craddm commented 9 months ago

Can confirm that the agent is there during the SRD image build

JimMadge commented 9 months ago

Great, I think the obvious thing to try is a step at the end of the build to rm -rf all of that.