Closed edwardchalstrey1 closed 8 months ago
Can you go to Operations > Updates
in the Portal on an affected VM and click Go to Updates using Automation
then click Enable
this ought to (after a few minutes) trigger discovery of the VM in the Azure Query that wasn't working in the screenshots above.
I'm unable to see the updates when I go to Operations > Updates
But I managed to find this page by going Update management center > Machines
and clicking on the VM:
Setting Enable
and clicking Save
resulted in this error:
NB. running Setup_SRE_Monitoring.ps1 -shmID <shm name> -sreId <sre name>
fixed the issue on a test deployment.
I'm unable to see the updates when I go to Operations > Updates
You have to click "Leave new experience" in the top bar.
Do NOT enable Update management center (preview)
as this is a new Update Management solution that hasn't yet been fully rolled out.
As mentioned above Do NOT enable Update management center (preview) as this is a new Update Management solution that hasn't yet been fully rolled out.
The problem appears to be the query to find VMs in the automation updates doesn't find the VMs (apart from the Guacamole ones for some reason)
@edwardchalstrey1 If this is similar the issue we had previously with backups, where permissions (through role assignments) need to be made between subscriptions you might want to look at the issue and the pr.
Here are the docs on how update management should work: https://learn.microsoft.com/en-us/azure/automation/update-management/overview https://learn.microsoft.com/en-us/azure/automation/update-management/plan-deployment
NB. running
Setup_SRE_Monitoring.ps1 -shmID <shm name> -sreId <sre name>
fixed the issue on a test deployment.
Note, this doesn't fix the issue for "Sandbox" - despite this SRE being in the same subscription as the SHM ("prod4")
The problem appears to be that the search query of a schedule deployment doesn't find the all VMs:
Just Guacamole (although weirdly when I hit referesh it occasionally finds other VMs 😱 ):
then this...
I also tried enabling VM insights to see if this would result in the VM I added it to showing up in the above list but it didn't
Running ./Setup_SRE_Monitoring.ps1
for Sandbox I noticed this:
2023-02-27 14:33:56 [SUCCESS]: [✔] Retrieved log analytics workspace 'shm-prod4-loganalytics.
2023-02-27 14:33:56 [ INFO]: [ ] Ensuring logging agent is installed on all SRE VMs...
wasn't followed by "Ensured that logging agent is installed on all SRE VMs." OR "Failed to ensure that logging agent is installed on all SRE VMs!" yes it was:
2023-02-27 14:34:00 [SUCCESS]: [✔] Ensured that logging agent is installed on all SRE VMs.
and the Log Analytics agent for linux called OmsAgentForLinux
does appear to be present for all VMs:
Attempting the troubleshoot check of the failure in this linux-updates “deployment run” looks like one failure:
Problem with multihoming
Good spot. What are the two Log Analytics Workspaces it's registered with?
But for other compute VMs, none of the updates run because the AzureQuery provided in the deployment schedule doesn't find the VMs:
I think this is where I got to with Ed on Friday. The AzureQuery should be able to see all VMs that are registered with the correct LogAnalyticsWorkspace, but it doesn't seem to be finding them.
Good spot. What are the two Log Analytics Workspaces it's registered with?
See the screenshot 4 comments up, there is AzureMonitorLinuxAgent
in addition to OmsAgentForLinux
, but when I delete these from Sandbox compute VM and re-run ./Setup_SRE_Monitoring.ps1
it doesn't have this, it only has OmsAgentForLinux
, so I may have added that manually earlier today by mistake (the Sandbox linux updates deployment run 3 comments up was from 21st Feb, so might be unrelated).
Looking at Extensions + applications
now I only see the one Log Analytics agents (are "agents" are the same as "Workspace"?)
The VM extensions are small scripts that add capabilities to the VMs. In this case AzureMonitorLinuxAgent
links a VM to an Azure Monitor
resource and OmsAgentForLinux
links a Linux VM to a LogAnalyticsWorkspace.
However, that error message gives the IDs for two LogAnalyticsWorkspaces - what are their names? One of them is the one that the VM is supposed to be connected to (see screenshot) but what is the second one?
(['5a7c5978-6c3f-4e2e-9f6a-a8063cf82914', 'c172386f-1bcb-48b2-b1e5-b937dc24fd3c']).
- let me check if I can see these
neither of which are the same as 4aea9c2f-9b6c-42e8-8b09-3594994fe238
in your screenshot above @jemrobinson
I can't see either of those in this list:
See screenshot I attached above. You need to look at each LogAnalyticsWorkspace and look at "Workspace ID" in the "Overview" tab.
Note: 4aea9c2f-9b6c-42e8-8b09-3594994fe238
is the subscription ID not the workspace ID.
See screenshot I attached above. You need to look at each LogAnalyticsWorkspace and look at "Workspace ID" in the "Overview" tab.
Note:
4aea9c2f-9b6c-42e8-8b09-3594994fe238
is the subscription ID not the workspace ID.
5a7c5978-6c3f-4e2e-9f6a-a8063cf82914
is shm-prod4-loganalytics
c172386f-1bcb-48b2-b1e5-b937dc24fd3c
is DefaultWorkspace-f871c3f7-6a68-42fb-bed6-81689e730f7a-SUK
Great. I don't think DefaultWorkspace-f871c3f7-6a68-42fb-bed6-81689e730f7a-SUK
is being used, so I'll delete that now. Can you see whether this resolves any of the other issues?
Also, there is a Microsoft-provided diagnostic tool here: https://learn.microsoft.com/en-us/azure/azure-monitor/agents/agent-linux-troubleshoot?WT.mc_id=Portal-Microsoft_Azure_Support#log-analytics-troubleshooting-tool
This might also be a problem with the Hybrid Runbook Worker and there are troubleshooting steps here.
NB. I'm sure you're already aware, but please do all potentially destructive testing on a dev environment like SRE Sandbox.
Having another look at this now - another thing to note is that the automated update management still tries to run for SREs that were torn down, which also results in a lot of failures:
Having another look at this now - another thing to note is that the automated update management still tries to run for SREs that were torn down, which also results in a lot of failures:
Yes, we should add "delete the automation schedules" to the SRE teardown script. If you'd like, you can manually delete them for the moment?
Log Analytics shows that all VMs for a given SRE are present:
Yet editing the update deployment Azure Query, the VMs that are found are inconsistent depending on the SRE subscription:
For the SREs that aren't Sandbox, none of the updates run, even for the VMs found by the Azure Query:
For the SREs that aren't Sandbox, none of the updates run, even for the VMs found by the Azure Query:
There seems to be some inconsistency between what's shown in the AzureQuery preview and what's actually found when the query runs. In the screenshot you've attached here the AzureQuery isn't finding any VMs. Is there a case where VMs are found but the updates aren't running?
I'm starting to think there might be something fundamental broken in the Azure Query system we should think about raising an issue with Microsoft - whilst clicking around on the New update deployment
page, I briefly "tricked" it into realising that all the VMs of a particular SRE were present, (I think by selecting both the SRE and SHM subscription, then un-selecting the SHM subscription)... yet when I try to duplicate this I am unable to do so!
Have you run through the Hybrid Runbook Worker troubleshooting I linked above? VMs with a correctly configured Hybrid Runbook Worker seem to be working correctly.
Looking at the troubleshooting checklist it looks like system-assigned managed identity is not enabled on our VMs:
[x] Check the OS is supported and the prerequisites have been met. See Prerequisites.
[x] Check whether the system-assigned managed identity is enabled on the VM. Azure VMs and Arc enabled Azure Machines should be enabled with a system-assigned managed identity.
[ ] Check whether the extension is enabled with the right settings. Setting file should have right AutomationAccountURL. Cross-check the URL with Automation account property - AutomationHybridServiceUrl.
For windows: you can find the settings file at C:\Packages\Plugins\Microsoft.Azure.Automation.HybridWorker.HybridWorkerForWindows\
[ ] Check the error message shown in the Hybrid worker extension status/Detailed Status. It contains error message(s) and respective recommendation(s) to fix the issue.
[ ] Run the troubleshooter tool on the VM and it will generate an output file. Open the output file and verify the errors identified by the troubleshooter tool.
For windows: you can find the troubleshooter at C:\Packages\Plugins\Microsoft.Azure.Automation.HybridWorker.HybridWorkerForWindows\
[ ] Check whether the hybrid worker process is running.
For Windows: check the Hybrid Worker Service service. For Linux: check the hwd. service. Collect logs:
For Windows: Run the log collector tool in C:\Packages\Plugins\Microsoft.Azure.Automation.HybridWorker.HybridWorkerForWindows\
Interesting. Here's a VM where updates are working but there's no system-assigned identity.
@jemrobinson I added a screenshot to your previous comment - looks like when I click on the Edon one it shows the 1 Hybrid Workers as well
Just to confirm, is it true that:
Log Analytics workspace > Agents Management > See them in logs
Hybrid worker troubleshooter doesn't appear to be installed (should be Microsoft.Azure.Automation.HybridWorker.HybridWorkerForLinux-<version>/Troubleshooter/LinuxTroubleshooter.py
):
sresbox2admin:~$ sudo ls /var/lib/waagent/
Certificates.p7m
Certificates.pem
Certificates.xml
DD8B981580F62B2080D6F8160CB402AE50BF90ED.crt
DD8B981580F62B2080D6F8160CB402AE50BF90ED.prv
events
event_status.json
GoalState.1.xml
history
Incarnation
initial_goal_state
logcollector
Microsoft.CPlat.Core.RunCommandLinux-1.0.5
Microsoft.CPlat.Core.RunCommandLinux__1.0.5.zip
Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.23
Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux__1.14.23.zip
ovf-env.xml
partition
Protocol
provisioned
published_hostname
run-command
SharedConfig.xml
TransportCert.pem
TransportPrivate.pem
waagent-network-setup.py
WALinuxAgent-2.7.3.0
WALinuxAgent-2.7.3.0.zip
WALinuxAgent-2.8.0.11
WALinuxAgent-2.8.0.11.zip
WALinuxAgent-2.9.0.4
WALinuxAgent-2.9.0.4.zip
WireServerEndpoint
Hybrid worker troubleshooter doesn't appear to be installed
... and for a VM where updates are working it is installed?
... and for a VM where updates are working it is installed?
No
Just to confirm, is it true that:
- all VMs that this is working for show up when doing Log Analytics workspace > Agents Management > See them in logs
- all VMs that this is not working for don't show up
@jemrobinson where should I be seeing them here?
It's in "Agents Management" not "Legacy agents management"
or you can just click on "Logs" and enter this query directly:
Heartbeat
| where OSType == 'Linux'
| summarize arg_max(TimeGenerated, *) by SourceComputerId
| sort by Computer
| render table
weird that for you it's called "Agents management" and just "Agents" on mine
Just to confirm, is it true that:
- all VMs that this is working for show up when doing Log Analytics workspace > Agents Management > See them in logs
- all VMs that this is not working for don't show up
Looks this way yes
Just to confirm, is it true that:
- all VMs that this is working for show up when doing Log Analytics workspace > Agents Management > See them in logs
- all VMs that this is not working for don't show up
Looks this way yes
OK, so can we try disconnecting and reconnecting a VM from LogAnalytics. I think this will require:
Setup_SRE_Monitoring.ps1
Ok sure, it just so happens I have just deployed another SRE that we can mess with (sbox2
):
OK, but let's confirm that the VMs in this SRE are not working before trying to fix them :)
OK, but let's confirm that the VMs in this SRE are not working before trying to fix them :)
Can I double check what we mean by "not working"
For sbox2
we have the following:
Looks like just Guacamole
You're not seeing all the VMs (since CoCalc and CodiMD don't have SBOX2 in their name). Try adding a filter on the resource group eg.
Heartbeat
| where OSType == 'Linux'
| where Category != 'Azure Monitor Agent'
| where ResourceGroup contains 'sbox2'
| summarize arg_max(TimeGenerated, *) by SourceComputerId
| sort by Computer
| render table
and you should see
The same thing will also apply to the Hybrid Worker Groups.
Ok so that's true, and it's the same case for other SREs where the update management isn't working (all of them I thought), so this SRE (sbox2) counts as "not working" I assume?
query only finds guacamole:
or does it...!
:white_check_mark: Checklist
:computer: System information
:cactus: Powershell module versions
:no_entry_sign: Describe the problem
For recently deployed SREs (e.g.
dsgkeep
) the Update management is only working for Guacamole, but not the other VMs, including the compute VM. This has resulted in packages not being updated, for example, it caused the problem that resulted in #1401This is strange because it looks like the VM is connected to loganalytics:
:steam_locomotive: Workarounds or solutions