Azure / azure-cli

Azure Command-Line Interface
MIT License
4.03k stars 3.01k forks source link

Az Cli Gateway updates roll back each other #28615

Closed KasperMoss closed 6 months ago

KasperMoss commented 8 months ago

Describe the bug

When updating an Application Gateway with Azure CLI, under specific timings, two calls might end up with one of them overwriting the other.

The problem is an interaction between Application Gateway and the default retry logic of the Azure CLI. When a system returns a retryable error, Azure CLI will retry the exact same command once more. This does not work for Application Gateway though, as it returns a 429 - Retryable error if two commands happen at the same time (or close to). Because Application Gateway takes the entire state of the resource in its Put command, retrying will always delete the change that competed with it and caused the 429 in the first place. In other words, getting this retryable error and handling it will revert changes.

Solution suggestions:

A major problem in the current solution is that Azure CLI completely hides the error from us - we do not, without --debug enabled, get any indication that there has been any trouble and everything passes correctly.

The below graph shows how the issue happened when two pipelines (running independently of each other) ended up with one removing the change the other had made. The graph should be viewed as a timeline, with items further down on the graph happening later.

Annotation 2024-03-22 125804

Related command

az network application-gateway address-pool update

Errors

No error messages are received by default. With --debug on, the error is shown as a 429 - RetryableError response.

Issue script & Debug output

Command (note that this works fine outside of the timing issues) $verificationAddressPool = az network application-gateway address-pool update --gateway-name $appGatewayName --name $verificationAddressPoolName --resource-group $AppGatewayResourceGroup --servers $privateEndpointIp

Adding a debug log to this would be cumbersome and nearly unreadable as it would include several debug logs that would need to be compared to each other to understand the issue. See the above graph instead.

Expected behavior

Two sequential updates that ends with both solutions applied

Environment Summary

azure-cli 2.58.0

core 2.58.0 telemetry 1.1.0

Extensions: azure-devops 1.0.0

Dependencies: msal 1.26.0 azure-mgmt-resource 23.1.0b2

Python location '/opt/az/bin/python3' Extensions directory '/opt/az/azcliextensions'

Python (Linux) 3.11.7 (main, Feb 29 2024, 02:08:19) [GCC 11.4.0]

Additional context

We have tried with az network application-gateway wait --resource-group $ResourceGroup --name $AppGatewayName --updated but this does not work as an update is not in progress before both commands finish their get call

yonzhan commented 8 months ago

Thank you for opening this issue, we will look into it.

necusjz commented 8 months ago

@KasperMoss Thanks for your detailed description!

Do you think az network application-gateway wait --resource-group $ResourceGroup --name $AppGatewayName wait --custom <a-custom-JMESPath-query> could be a workaround here?

necusjz commented 8 months ago

add @appgwsuppgithub for awareness

KasperMoss commented 8 months ago

Hi, @necusjz

I am not sure - we do not know the order that the environments will reach this part of the process - it has to happen after our BICEP templates have finished deploying and that step (plus a few others) introduce a lot of variance in the system. I am not sure which custom query we could use here that would be generic to the application gateway overall, and different from the wait we are already using.

Sorry about the slow reply - I have been away for Easter