Open anhvoms opened 4 months ago
One thing we might want to consider is replacing the hand-written IMDS client with https://crates.io/crates/azure_svc_imds from https://github.com/Azure/azure-sdk-for-rust/ which, based on the documentation, already implements retries.
One thing we might want to consider is replacing the hand-written IMDS client with https://crates.io/crates/azure_svc_imds from https://github.com/Azure/azure-sdk-for-rust/ which, based on the documentation, already implements retries.
Oh, this is nice. I wasn't aware of this crate. We should check it out and see if the retry policy is easy to customize to our need.
Current situation
There's no retry when REST API calls to IMDS or wireserver (goal_state, report_health)
Impact
Without retry, if there's a transient issue from Azure platform, provisioning will fail
Additional information
When to retry and how many times/how long to retry is a complex topic, especially when IMDS/Wireserver does not provide any guidance. This is the current behavior from cloud-init (ref, ref), which we can use as a reference (or perhaps we can provide this as a config that can be configured within the image? e.g., /etc/azure-init/azure-init.conf)
Total retrying time for IMDS should total no more than 5 minutes, for Wireserver 20 minutes. Retry around Connection timeout/Read timeout: timeout for rest call should be set at 30s Retry around non-200 http error codes (410, 404, 503, 400, 500, 429): timeout should be set at 2s, with backoff of 1s