alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
58 stars 15 forks source link

Failure to install extension 'OmsAgentForLinux' on SRD #1372

Closed edwardchalstrey1 closed 8 months ago

edwardchalstrey1 commented 1 year ago

:white_check_mark: Checklist

:computer: System information

:cactus: Powershell module versions

2023-02-01 13:20:47 [SUCCESS]: [✔] Powershell version: 7.2.6
2023-02-01 13:20:47 [SUCCESS]: [✔] Powershell-Yaml module version: 0.4.2
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Monitor module version: 3.0.1
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.RecoveryServices module version: 5.4.1
2023-02-01 13:20:47 [SUCCESS]: [✔] Microsoft.Graph.Applications module version: 1.17.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.OperationalInsights module version: 3.1.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Microsoft.Graph.Authentication module version: 1.17.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Compute module version: 4.29.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.MonitoringSolutions module version: 0.1.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Network module version: 4.18.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Automation module version: 1.7.3
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Dns module version: 1.1.2
2023-02-01 13:20:47 [SUCCESS]: [✔] Microsoft.Graph.Identity.DirectoryManagement module version: 1.10.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.PrivateDns module version: 1.0.3
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.DataProtection module version: 0.4.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Accounts module version: 2.9.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Poshstache module version: 0.1.10
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Resources module version: 6.0.1
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.Storage module version: 4.7.0
2023-02-01 13:20:47 [SUCCESS]: [✔] Az.KeyVault module version: 4.6.0

:no_entry_sign: Describe the problem

On deployment of an SRE, we sometimes get the below error message. This can be easily resolved by re-running the ./Setup_SRE_Monitoring.ps1 script (and any subsequent scripts run by ./Deploy_SRE.ps1 to be safe), but we don't know why this happens.

:deciduous_tree: Error message

2023-01-27 14:25:12 [   INFO]: [ ] Ensuring extension 'OmsAgentForLinux' is installed on VM 'SRE-DSGGW-160-SRD-20-04-2022112900'.
2023-01-27 14:31:17 [FAILURE]: [x] Failed to install extension 'OmsAgentForLinux' on VM 'SRE-DSGGW-160-SRD-20-04-2022112900'!
Exception: /Users/cmole/git_repos/data-safe-haven/deployment/secure_research_environment/setup/Setup_SRE_Monitoring.ps1:63:51
Line |
  63 |  … ForEach-Object { Get-AzVM -ResourceGroup $_.ResourceGroupName } | For …
     |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | Long running operation failed with status 'Failed'. Additional Info:'VM has reported a
     | failure when processing extension 'OmsAgentForLinux'. Error message: "Enable failed with exit
     | code 52 Couldn't create marker file"  More information on troubleshooting is available at
     | https://aka.ms/VMExtensionOMSAgentLinuxTroubleshoot ' ErrorCode: VMExtensionProvisioningError
     | ErrorMessage: VM has reported a failure when processing extension 'OmsAgentForLinux'. Error
     | message: "Enable failed with exit code 52 Couldn't create marker file"  More information on
     | troubleshooting is available at https://aka.ms/VMExtensionOMSAgentLinuxTroubleshoot 
     | ErrorTarget:  StartTime: 27/01/2023 14:30:16 EndTime: 27/01/2023 14:30:17 OperationID:
     | 6d4744bd-de24-4b13-a8a6-0d81c049f935 Status: Failed

:recycle: To reproduce

When running ./Deploy_SRE.ps1, the above sometimes happens during the call of ./Setup_SRE_Monitoring.ps1

Re-running ./Setup_SRE_Monitoring.ps1 works:

Screenshot 2023-01-30 at 09 29 57

JimMadge commented 1 year ago

Possibly related https://github.com/Azure/azure-linux-extensions/issues/1116

https://github.com/Azure/azure-linux-extensions/issues/1116#issuecomment-637677864

I haven't tested this out yet but I believe the issue I was running into may be because of the policy that automatically deployed the OMS Agent to applicable VMs in the subscription. I noticed that when looking at the VMSS instances not able to install extension due to error 52 (marker file) there was already an installation of the OMS agent on the machine and it was pointing to the default workspace. This I believe is because of the Azure policy.

jemrobinson commented 1 year ago

Can we update ./Setup_SRE_Monitoring.ps1 to:

JimMadge commented 1 year ago

It does seem like that would be a way around this as rerunning the same commands later seems to identify a misconfigured extension and handle it correctly.

JimMadge commented 1 year ago

Look like we are already doing that in the called function...

https://github.com/alan-turing-institute/data-safe-haven/blob/aef639f5fd5204c66f0f6e4cb89c844100791ff5/deployment/common/AzureCompute.psm1#L470-L531

edwardchalstrey1 commented 1 year ago

Look like we are already doing that in the called function...

https://github.com/alan-turing-institute/data-safe-haven/blob/aef639f5fd5204c66f0f6e4cb89c844100791ff5/deployment/common/AzureCompute.psm1#L470-L531

@JimMadge do you think it would be sensible to change line 497 foreach ($i in 1..5) { from just attempting it 5 times to either a very high number or infinite (e.g. with a while)?

We could move the Setup_SRE_Monitoring.ps1 to be the last step in Deploy_SRE.ps1 (currently it's the penultimate step before Setup_SRE_Backup.ps1)

JimMadge commented 1 year ago

No, I don't think so. It might just result in the script running for a very long time or indefinitely.

I'm not entirely convinced that more iterations of the loop will solve the problem. However, it would be worth testing out, with more iterations or a longer wait time between iterations.

From the exception above Enable failed with exit code 52 Couldn't create marker file and StartTime: 27/01/2023 14:30:16 EndTime: 27/01/2023 14:30:17 suggest it isn't a case of not waiting long enough.

edwardchalstrey1 commented 1 year ago

TODO:

jemrobinson commented 1 year ago

The GitHub issue that @JimMadge linked to mentioned that one possible issue might be that the VM is already attached to an (incorrect) LogAnalytics workspace. Can you see whether this is true for any of the VMs which are error-ing @edwardchalstrey1 ?

conradj3 commented 1 year ago

No, I don't think so. It might just result in the script running for a long time or indefinitely.

I'm not entirely convinced that more iterations of the loop will solve the problem. However, it would be worth testing out, with more iterations or a longer wait time between iterations.

From the exception above Enable failed with exit code 52 Couldn't create marker file and StartTime: 27/01/2023 14:30:16 EndTime: 27/01/2023 14:30:17 suggest it isn't a case of not waiting long enough.

We have the same problem we use the GitHub Virtual Runners for our self-hosted agent pools. If you try to tie in the Azure OMS extension on any Ubuntu distro, you are presented with:

VM has reported a failure when processing extension 'OMSAgentForLinux'. Error message: "Enable failed with exit code 52 Couldn't create marker file"

From what I collected from logs it's a user permission issue; the installation account is unable to write to the location when the oms-script install kicks off. We gave up on having Azure Monitor / Insights on our scale sets. We moved to use another oms agent away from the native Azure.