alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
57 stars 14 forks source link

Update management is not running for VMs of recently deployed SREs (except Guacamole) #1403

Closed edwardchalstrey1 closed 7 months ago

edwardchalstrey1 commented 1 year ago

:white_check_mark: Checklist

:computer: System information

:cactus: Powershell module versions

2023-02-24 09:58:58 [SUCCESS]: [✔] Powershell version: 7.2.6
2023-02-24 09:58:58 [SUCCESS]: [✔] Poshstache module version: 0.1.10
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.KeyVault module version: 4.6.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Accounts module version: 2.9.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Microsoft.Graph.Identity.DirectoryManagement module version: 1.10.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Microsoft.Graph.Authentication module version: 1.17.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Storage module version: 4.7.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Network module version: 4.18.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.PrivateDns module version: 1.0.3
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.RecoveryServices module version: 5.4.1
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Compute module version: 4.29.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Resources module version: 6.0.1
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Automation module version: 1.7.3
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Dns module version: 1.1.2
2023-02-24 09:58:58 [SUCCESS]: [✔] Powershell-Yaml module version: 0.4.2
2023-02-24 09:58:58 [SUCCESS]: [✔] Microsoft.Graph.Applications module version: 1.17.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.MonitoringSolutions module version: 0.1.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.OperationalInsights module version: 3.1.0
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.Monitor module version: 3.0.1
2023-02-24 09:58:58 [SUCCESS]: [✔] Az.DataProtection module version: 0.4.0

:no_entry_sign: Describe the problem

For recently deployed SREs (e.g. dsgkeep) the Update management is only working for Guacamole, but not the other VMs, including the compute VM. This has resulted in packages not being updated, for example, it caused the problem that resulted in #1401

Screenshot 2023-02-24 at 09 59 57

This is strange because it looks like the VM is connected to loganalytics:

Screenshot 2023-02-24 at 10 40 45

:steam_locomotive: Workarounds or solutions

jemrobinson commented 1 year ago

Can you go to Operations > Updates in the Portal on an affected VM and click Go to Updates using Automation

Screenshot 2023-02-24 at 15 31 19

then click Enable

Screenshot 2023-02-24 at 15 18 31

this ought to (after a few minutes) trigger discovery of the VM in the Azure Query that wasn't working in the screenshots above.

edwardchalstrey1 commented 1 year ago

I'm unable to see the updates when I go to Operations > Updates Screenshot 2023-02-24 at 15 48 52

But I managed to find this page by going Update management center > Machines and clicking on the VM: Screenshot 2023-02-24 at 15 51 16

Setting Enable and clicking Save resulted in this error: Screenshot 2023-02-24 at 15 51 48

jemrobinson commented 1 year ago

NB. running Setup_SRE_Monitoring.ps1 -shmID <shm name> -sreId <sre name> fixed the issue on a test deployment.

jemrobinson commented 1 year ago

I'm unable to see the updates when I go to Operations > Updates

You have to click "Leave new experience" in the top bar.

Do NOT enable Update management center (preview) as this is a new Update Management solution that hasn't yet been fully rolled out.

edwardchalstrey1 commented 1 year ago

Screenshot 2023-02-24 at 16 24 49

jemrobinson commented 1 year ago

As mentioned above Do NOT enable Update management center (preview) as this is a new Update Management solution that hasn't yet been fully rolled out.

edwardchalstrey1 commented 1 year ago

The problem appears to be the query to find VMs in the automation updates doesn't find the VMs (apart from the Guacamole ones for some reason)

JimMadge commented 1 year ago

@edwardchalstrey1 If this is similar the issue we had previously with backups, where permissions (through role assignments) need to be made between subscriptions you might want to look at the issue and the pr.

edwardchalstrey1 commented 1 year ago

Here are the docs on how update management should work: https://learn.microsoft.com/en-us/azure/automation/update-management/overview https://learn.microsoft.com/en-us/azure/automation/update-management/plan-deployment

edwardchalstrey1 commented 1 year ago

NB. running Setup_SRE_Monitoring.ps1 -shmID <shm name> -sreId <sre name> fixed the issue on a test deployment.

Note, this doesn't fix the issue for "Sandbox" - despite this SRE being in the same subscription as the SHM ("prod4")

edwardchalstrey1 commented 1 year ago

The problem appears to be that the search query of a schedule deployment doesn't find the all VMs: Screenshot 2023-02-27 at 14 23 32

Just Guacamole (although weirdly when I hit referesh it occasionally finds other VMs 😱 ): Screenshot 2023-02-27 at 14 23 55

then this...

Screenshot 2023-02-27 at 14 34 33

I also tried enabling VM insights to see if this would result in the VM I added it to showing up in the above list but it didn't

edwardchalstrey1 commented 1 year ago

Running ./Setup_SRE_Monitoring.ps1 for Sandbox I noticed this:

2023-02-27 14:33:56 [SUCCESS]: [✔] Retrieved log analytics workspace 'shm-prod4-loganalytics.
2023-02-27 14:33:56 [   INFO]: [ ] Ensuring logging agent is installed on all SRE VMs...

wasn't followed by "Ensured that logging agent is installed on all SRE VMs." OR "Failed to ensure that logging agent is installed on all SRE VMs!" yes it was:

2023-02-27 14:34:00 [SUCCESS]: [✔] Ensured that logging agent is installed on all SRE VMs.

and the Log Analytics agent for linux called OmsAgentForLinux does appear to be present for all VMs: Screenshot 2023-02-27 at 15 23 26

edwardchalstrey1 commented 1 year ago

Attempting the troubleshoot check of the failure in this linux-updates “deployment run” looks like one failure:

Screenshot 2023-02-27 at 15 41 09

Problem with multihoming Screenshot 2023-02-27 at 15 43 25

jemrobinson commented 1 year ago

Good spot. What are the two Log Analytics Workspaces it's registered with?

edwardchalstrey1 commented 1 year ago

But for other compute VMs, none of the updates run because the AzureQuery provided in the deployment schedule doesn't find the VMs:

Screenshot 2023-02-27 at 15 54 17

jemrobinson commented 1 year ago

I think this is where I got to with Ed on Friday. The AzureQuery should be able to see all VMs that are registered with the correct LogAnalyticsWorkspace, but it doesn't seem to be finding them.

edwardchalstrey1 commented 1 year ago

Good spot. What are the two Log Analytics Workspaces it's registered with?

See the screenshot 4 comments up, there is AzureMonitorLinuxAgent in addition to OmsAgentForLinux, but when I delete these from Sandbox compute VM and re-run ./Setup_SRE_Monitoring.ps1 it doesn't have this, it only has OmsAgentForLinux, so I may have added that manually earlier today by mistake (the Sandbox linux updates deployment run 3 comments up was from 21st Feb, so might be unrelated).

Looking at Extensions + applications now I only see the one Log Analytics agents (are "agents" are the same as "Workspace"?)

Screenshot 2023-02-27 at 16 01 41

jemrobinson commented 1 year ago

The VM extensions are small scripts that add capabilities to the VMs. In this case AzureMonitorLinuxAgent links a VM to an Azure Monitor resource and OmsAgentForLinux links a Linux VM to a LogAnalyticsWorkspace.

However, that error message gives the IDs for two LogAnalyticsWorkspaces - what are their names? One of them is the one that the VM is supposed to be connected to (see screenshot) but what is the second one?

Screenshot 2023-02-27 at 16 12 32
edwardchalstrey1 commented 1 year ago

(['5a7c5978-6c3f-4e2e-9f6a-a8063cf82914', 'c172386f-1bcb-48b2-b1e5-b937dc24fd3c']). - let me check if I can see these

neither of which are the same as 4aea9c2f-9b6c-42e8-8b09-3594994fe238 in your screenshot above @jemrobinson

edwardchalstrey1 commented 1 year ago

I can't see either of those in this list: Screenshot 2023-02-27 at 16 40 27

jemrobinson commented 1 year ago

See screenshot I attached above. You need to look at each LogAnalyticsWorkspace and look at "Workspace ID" in the "Overview" tab.

Note: 4aea9c2f-9b6c-42e8-8b09-3594994fe238 is the subscription ID not the workspace ID.

edwardchalstrey1 commented 1 year ago

See screenshot I attached above. You need to look at each LogAnalyticsWorkspace and look at "Workspace ID" in the "Overview" tab.

Note: 4aea9c2f-9b6c-42e8-8b09-3594994fe238 is the subscription ID not the workspace ID.

5a7c5978-6c3f-4e2e-9f6a-a8063cf82914 is shm-prod4-loganalytics c172386f-1bcb-48b2-b1e5-b937dc24fd3c is DefaultWorkspace-f871c3f7-6a68-42fb-bed6-81689e730f7a-SUK

Screenshot 2023-02-27 at 16 44 29

jemrobinson commented 1 year ago

Great. I don't think DefaultWorkspace-f871c3f7-6a68-42fb-bed6-81689e730f7a-SUK is being used, so I'll delete that now. Can you see whether this resolves any of the other issues?

jemrobinson commented 1 year ago

Also, there is a Microsoft-provided diagnostic tool here: https://learn.microsoft.com/en-us/azure/azure-monitor/agents/agent-linux-troubleshoot?WT.mc_id=Portal-Microsoft_Azure_Support#log-analytics-troubleshooting-tool

jemrobinson commented 1 year ago

This might also be a problem with the Hybrid Runbook Worker and there are troubleshooting steps here.

NB. I'm sure you're already aware, but please do all potentially destructive testing on a dev environment like SRE Sandbox.

edwardchalstrey1 commented 1 year ago

Having another look at this now - another thing to note is that the automated update management still tries to run for SREs that were torn down, which also results in a lot of failures:

Screenshot 2023-03-01 at 11 52 15
jemrobinson commented 1 year ago

Having another look at this now - another thing to note is that the automated update management still tries to run for SREs that were torn down, which also results in a lot of failures:

Yes, we should add "delete the automation schedules" to the SRE teardown script. If you'd like, you can manually delete them for the moment?

edwardchalstrey1 commented 1 year ago

Log Analytics shows that all VMs for a given SRE are present:

Screenshot 2023-03-01 at 12 05 40

Yet editing the update deployment Azure Query, the VMs that are found are inconsistent depending on the SRE subscription:

Screenshot 2023-03-01 at 12 07 51 Screenshot 2023-03-01 at 12 09 31

For the SREs that aren't Sandbox, none of the updates run, even for the VMs found by the Azure Query:

Screenshot 2023-03-01 at 12 12 07
jemrobinson commented 1 year ago

For the SREs that aren't Sandbox, none of the updates run, even for the VMs found by the Azure Query:

There seems to be some inconsistency between what's shown in the AzureQuery preview and what's actually found when the query runs. In the screenshot you've attached here the AzureQuery isn't finding any VMs. Is there a case where VMs are found but the updates aren't running?

edwardchalstrey1 commented 1 year ago

I'm starting to think there might be something fundamental broken in the Azure Query system we should think about raising an issue with Microsoft - whilst clicking around on the New update deployment page, I briefly "tricked" it into realising that all the VMs of a particular SRE were present, (I think by selecting both the SRE and SHM subscription, then un-selecting the SHM subscription)... yet when I try to duplicate this I am unable to do so!

Screenshot 2023-03-02 at 09 48 11

edwardchalstrey1 commented 1 year ago

 Preview vs edit of the same query:

Screenshot 2023-03-02 at 10 07 55 Screenshot 2023-03-02 at 10 07 40

jemrobinson commented 1 year ago

Have you run through the Hybrid Runbook Worker troubleshooting I linked above? VMs with a correctly configured Hybrid Runbook Worker seem to be working correctly.

VM where updates are working

Screenshot 2023-03-03 at 12 05 37

VM where updates are not working

Screenshot 2023-03-03 at 12 05 40

EC note: when I go on this it looks like working

Screenshot 2023-03-03 at 13 45 26

edwardchalstrey1 commented 1 year ago

Looking at the troubleshooting checklist it looks like system-assigned managed identity is not enabled on our VMs:

Screenshot 2023-03-03 at 13 31 58

edwardchalstrey1 commented 1 year ago

Troubleshoot VM extension-based Hybrid Runbook Worker issues in Automation

jemrobinson commented 1 year ago

Interesting. Here's a VM where updates are working but there's no system-assigned identity.

Screenshot 2023-03-03 at 13 42 30 Screenshot 2023-03-03 at 13 45 14
edwardchalstrey1 commented 1 year ago

@jemrobinson I added a screenshot to your previous comment - looks like when I click on the Edon one it shows the 1 Hybrid Workers as well

jemrobinson commented 1 year ago

Just to confirm, is it true that:

edwardchalstrey1 commented 1 year ago

Hybrid worker troubleshooter doesn't appear to be installed (should be Microsoft.Azure.Automation.HybridWorker.HybridWorkerForLinux-<version>/Troubleshooter/LinuxTroubleshooter.py):

sresbox2admin:~$ sudo ls /var/lib/waagent/
Certificates.p7m
Certificates.pem
Certificates.xml
DD8B981580F62B2080D6F8160CB402AE50BF90ED.crt
DD8B981580F62B2080D6F8160CB402AE50BF90ED.prv
events
event_status.json
GoalState.1.xml
history
Incarnation
initial_goal_state
logcollector
Microsoft.CPlat.Core.RunCommandLinux-1.0.5
Microsoft.CPlat.Core.RunCommandLinux__1.0.5.zip
Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.23
Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux__1.14.23.zip
ovf-env.xml
partition
Protocol
provisioned
published_hostname
run-command
SharedConfig.xml
TransportCert.pem
TransportPrivate.pem
waagent-network-setup.py
WALinuxAgent-2.7.3.0
WALinuxAgent-2.7.3.0.zip
WALinuxAgent-2.8.0.11
WALinuxAgent-2.8.0.11.zip
WALinuxAgent-2.9.0.4
WALinuxAgent-2.9.0.4.zip
WireServerEndpoint
jemrobinson commented 1 year ago

Hybrid worker troubleshooter doesn't appear to be installed

... and for a VM where updates are working it is installed?

edwardchalstrey1 commented 1 year ago

... and for a VM where updates are working it is installed?

No Screenshot 2023-03-03 at 13 57 23

edwardchalstrey1 commented 1 year ago

Just to confirm, is it true that:

  • all VMs that this is working for show up when doing Log Analytics workspace > Agents Management > See them in logs
  • all VMs that this is not working for don't show up

@jemrobinson where should I be seeing them here?

Screenshot 2023-03-03 at 14 00 05

jemrobinson commented 1 year ago

It's in "Agents Management" not "Legacy agents management"

Screenshot 2023-03-03 at 14 03 08

or you can just click on "Logs" and enter this query directly:

Heartbeat
| where OSType == 'Linux'
| summarize arg_max(TimeGenerated, *) by SourceComputerId
| sort by Computer
| render table
edwardchalstrey1 commented 1 year ago

weird that for you it's called "Agents management" and just "Agents" on mine

edwardchalstrey1 commented 1 year ago

Just to confirm, is it true that:

  • all VMs that this is working for show up when doing Log Analytics workspace > Agents Management > See them in logs
  • all VMs that this is not working for don't show up

Looks this way yes

jemrobinson commented 1 year ago

Just to confirm, is it true that:

  • all VMs that this is working for show up when doing Log Analytics workspace > Agents Management > See them in logs
  • all VMs that this is not working for don't show up

Looks this way yes

OK, so can we try disconnecting and reconnecting a VM from LogAnalytics. I think this will require:

edwardchalstrey1 commented 1 year ago

Ok sure, it just so happens I have just deployed another SRE that we can mess with (sbox2):

jemrobinson commented 1 year ago

OK, but let's confirm that the VMs in this SRE are not working before trying to fix them :)

edwardchalstrey1 commented 1 year ago

OK, but let's confirm that the VMs in this SRE are not working before trying to fix them :)

Can I double check what we mean by "not working"

For sbox2 we have the following:

Hybrid worker groups

  1. Guacamole VM
  2. Compute VM

Screenshot 2023-03-03 at 14 26 08

Logs

Looks like just Guacamole

Screenshot 2023-03-03 at 14 24 27

Azure Query finds...

  1. Guacamole
  2. Cocalc

Screenshot 2023-03-03 at 14 25 16

jemrobinson commented 1 year ago

You're not seeing all the VMs (since CoCalc and CodiMD don't have SBOX2 in their name). Try adding a filter on the resource group eg.

Heartbeat
| where OSType == 'Linux'
| where Category != 'Azure Monitor Agent'
| where ResourceGroup contains 'sbox2'
| summarize arg_max(TimeGenerated, *) by SourceComputerId
| sort by Computer
| render table

and you should see

The same thing will also apply to the Hybrid Worker Groups.

edwardchalstrey1 commented 1 year ago

Ok so that's true, and it's the same case for other SREs where the update management isn't working (all of them I thought), so this SRE (sbox2) counts as "not working" I assume?

query only finds guacamole:

Screenshot 2023-03-03 at 14 40 08

or does it...! Screenshot 2023-03-03 at 14 47 32