Azure / azure-init

A minimal provisioning agent designed for Azure Linux VMs.
MIT License
7 stars 10 forks source link

[RFE] Report failures to Azure when there's an unrecoverable error #58

Open anhvoms opened 4 months ago

anhvoms commented 4 months ago

Current situation

Currently azure-init does not report any failure to Azure. If it can't finish provisioning, it will return with an error code. From a user perspective, provisioning will eventually fail with OS provisioning timeout due to Azure platform not receiving a provisioning complete signal.

In many cases the user might not be able to access the VM if provisioning fails and as such, might have a very hard time figuring out why provisioning failed

Ideal future situation

Have the azure-init report failures to Azure, which will then fail provisioning with a useful error message indicating why provisioning failed.

**Implementation options

These are not two mutually exclusive options, but rather complimenting each other.   1) Use wireserver to report errors to the platform. Here is how cloud-init is doing it. Essentially azure-init will need to construct a health report similar to reporting provisioning complete, but indicating the report status as NotReady, a substatus of ProvisioningFailed, and a meaningful description that will eventually show up as an error message back to the user. I would strongly encourage azure-init to follow the error messages used by cloud-init, because we have post-processing, monitoring, and alerting mechanism built around the errors returned by cloud-init. A sample error returned by cloud-init

result=error|reason=http error querying IMDS|agent=Cloud-Init/23.3.3-0ubuntu0~20.04.1|http_code=410|duration=300.2051315307617|'exception=UrlError(''410 Client Error: Gone for url: http://169.254.169.254/metadata/instance?api-version=2021-08-01&extended=true'')'|url=http://169.254.169.254/metadata/instance?api-version=2021-08-01&extended=true|vm_id=e76f68ac-04a8-4069-be7c-7f04b01f520f|timestamp=2024-03-12T09:39:16.373226|documentation_url=https://aka.ms/linuxprovisioningerror

2) The failure reporting via wireserver only works if azure-init can establish communication to wireserver and can successfully post the error. In the cases where it's not working, the other option is to write a KVP with the error and Azure platform will process it. See cloud-init implementation as reference