Closed KasperMoss closed 6 months ago
Thank you for opening this issue, we will look into it.
@KasperMoss Thanks for your detailed description!
Do you think az network application-gateway wait --resource-group $ResourceGroup --name $AppGatewayName wait --custom <a-custom-JMESPath-query>
could be a workaround here?
add @appgwsuppgithub for awareness
Hi, @necusjz
I am not sure - we do not know the order that the environments will reach this part of the process - it has to happen after our BICEP templates have finished deploying and that step (plus a few others) introduce a lot of variance in the system. I am not sure which custom query we could use here that would be generic to the application gateway overall, and different from the wait we are already using.
Sorry about the slow reply - I have been away for Easter
Describe the bug
When updating an Application Gateway with Azure CLI, under specific timings, two calls might end up with one of them overwriting the other.
The problem is an interaction between Application Gateway and the default retry logic of the Azure CLI. When a system returns a retryable error, Azure CLI will retry the exact same command once more. This does not work for Application Gateway though, as it returns a 429 - Retryable error if two commands happen at the same time (or close to). Because Application Gateway takes the entire state of the resource in its Put command, retrying will always delete the change that competed with it and caused the 429 in the first place. In other words, getting this retryable error and handling it will revert changes.
Solution suggestions:
A major problem in the current solution is that Azure CLI completely hides the error from us - we do not, without --debug enabled, get any indication that there has been any trouble and everything passes correctly.
The below graph shows how the issue happened when two pipelines (running independently of each other) ended up with one removing the change the other had made. The graph should be viewed as a timeline, with items further down on the graph happening later.
Related command
az network application-gateway address-pool update
Errors
No error messages are received by default. With --debug on, the error is shown as a 429 - RetryableError response.
Issue script & Debug output
Command (note that this works fine outside of the timing issues)
$verificationAddressPool = az network application-gateway address-pool update --gateway-name $appGatewayName --name $verificationAddressPoolName --resource-group $AppGatewayResourceGroup --servers $privateEndpointIp
Adding a debug log to this would be cumbersome and nearly unreadable as it would include several debug logs that would need to be compared to each other to understand the issue. See the above graph instead.
Expected behavior
Two sequential updates that ends with both solutions applied
Environment Summary
azure-cli 2.58.0
core 2.58.0 telemetry 1.1.0
Extensions: azure-devops 1.0.0
Dependencies: msal 1.26.0 azure-mgmt-resource 23.1.0b2
Python location '/opt/az/bin/python3' Extensions directory '/opt/az/azcliextensions'
Python (Linux) 3.11.7 (main, Feb 29 2024, 02:08:19) [GCC 11.4.0]
Additional context
We have tried with
az network application-gateway wait --resource-group $ResourceGroup --name $AppGatewayName --updated
but this does not work as an update is not in progress before both commands finish their get call