[RFE] azure-init should add retries around IMDS and Wireserver operations

anhvoms commented 4 months ago

Current situation

There's no retry when REST API calls to IMDS or wireserver (goal_state, report_health)

Impact

Without retry, if there's a transient issue from Azure platform, provisioning will fail

Additional information

When to retry and how many times/how long to retry is a complex topic, especially when IMDS/Wireserver does not provide any guidance. This is the current behavior from cloud-init (ref, ref), which we can use as a reference (or perhaps we can provide this as a config that can be configured within the image? e.g., /etc/azure-init/azure-init.conf)

Total retrying time for IMDS should total no more than 5 minutes, for Wireserver 20 minutes. Retry around Connection timeout/Read timeout: timeout for rest call should be set at 30s Retry around non-200 http error codes (410, 404, 503, 400, 500, 429): timeout should be set at 2s, with backoff of 1s

jeremycline commented 4 months ago

One thing we might want to consider is replacing the hand-written IMDS client with https://crates.io/crates/azure_svc_imds from https://github.com/Azure/azure-sdk-for-rust/ which, based on the documentation, already implements retries.

anhvoms commented 4 months ago

One thing we might want to consider is replacing the hand-written IMDS client with https://crates.io/crates/azure_svc_imds from https://github.com/Azure/azure-sdk-for-rust/ which, based on the documentation, already implements retries.

Oh, this is nice. I wasn't aware of this crate. We should check it out and see if the retry policy is easy to customize to our need.

Azure / azure-init