Closed edwardchalstrey1 closed 8 months ago
Can you change the scheduled time for sre-sbox2-linux-updates
from the current value (every 7 days at starting next Tuesday at 02:02) to every 1 hours starting at 03/03/2023 15:00? This way it will run in 15 minutes time and we can see what it runs over.
hmm, didn't seem to run, will investigate - oh, wrong date, well it should run at 4pm now
Weird, said it was going to run at 4 but didn't
Maybe you just needed to wait for it to finish - seems to have run now.
if you click on this you can see which VMs it ran over and it seems to be all 5 VMs in the SRE without missing any.
so it's possible that the problem is that the query preview is not the same as what actually gets run? We can test this by changing another schedule that has been failing in the past.
For dsggw
for example the latest run looks like this - so I think you may be right that sbox2
isn't a good one to test this on:
OK, so can you update the timing on this one so it will run tonight? Just need to change start date from 28/02/2023 to 04/03/2023.
I think this might work since looking here (https://learn.microsoft.com/en-us/azure/automation/troubleshoot/update-management?WT.mc_id=Portal-Microsoft_Azure_Automation#nologs) I can see that
in this particular case, it looks like the problem is that the update job simply hasn't run recently.
Here's a VM that I think is not working and it seems to be due to an issue with the update agent.
Looks like Hybrid runbook worker is ok on this VM - I'll look at the Steps to fix Multihoming which we already identified earlier as having the extra log analytics workspace , but maybe we just need to reinstall "OMS-Agent-for-Linux" after you deleted the extra log analytics workspace @jemrobinson
I guess we expect the internet connectivity check to fail here as it's tier 3
@edwardchalstrey1 Did fixing multihoming help? If not, did you try reinstalling the OMS agent?
Have encountered the same bug. A VM reports being connected to the Log Analytics Workspace, and Automation Account is connected to the Log Analytics Workspace. However, the automation account does not apply updates.
After troubleshooting as Ed did above, Multihoming showed as failed again. Machine should not be multihomed (connected to >1 Log Analytics Workspace).
Possible workaround is check for this and redeploy
An additional point is adding an SRD manually using Add_Single_SRD.ps1
doesn't enable automatic updates or install the Oms agent, so needs to be documented that Setup_SRE_Monitoring.ps1
should also be run after adding an SRD
On a newly deployed SRD, there are what seem to be vestigial files for omsagent in /var/opt/microsoft/omsagent/
.
This id is the spurious DefaultWorkspace ID, and it is from there that the multihoming issue arises. There is already some kind of record of a workspace showing there, even though the portal shows that there are no extensions installed etc. It is possible, from expecting the logs of this omsagent, that this is something MS have running during the deployment of the VM, which never completes because there is no internet access
So I'm wondering if this is supposed to be deleted once the setup process is complete, but never is.
Great detective work @craddm! Could be a problem that occurs during the image building process? Might be worth adding something to the deployment-time cloud-init that deletes the /var/opt/microsoft/omsagent/
directory and seeing if that helps?
Perhaps the agent gets installed (or there is an attempt) during the build of the SRD image? Ensuring those files are deleted, if they exist, in cloud-init could work as long as that happens before the agent gets installed?
Can confirm that the agent is there during the SRD image build
Great, I think the obvious thing to try is a step at the end of the build to rm -rf
all of that.
:white_check_mark: Checklist
:computer: System information
:cactus: Powershell module versions
:no_entry_sign: Describe the problem
For recently deployed SREs (e.g.
dsgkeep
) the Update management is only working for Guacamole, but not the other VMs, including the compute VM. This has resulted in packages not being updated, for example, it caused the problem that resulted in #1401This is strange because it looks like the VM is connected to loganalytics:
:steam_locomotive: Workarounds or solutions