Azure App Service fails to swap deployment slots due to http ping failures

jchli commented 2 months ago

Describe the bug

We have a windows App Service, running .NET 6 in West US. When we deploy our application, it sometimes fails with an error on the "swap slot" operation:

Failed swapping site. Error: Cannot swap site slots for site 'my-site' because the 'blue' slot did not respond to http ping.

It is always the same error, and our blue slot always responds successfully to our pings when we hit it manually during these operations.

Further, this doesn't happen consistently, but started to crop up more often (~40-50% of our deploys across two totally separate directories) since 2024-06-28. We have an open support ticket with azure support, but haven't gotten any information, yet and I wanted to open this up to the community to see who else is getting this problem, especially because the previous two GH issues were closed.

This is highly disruptive for our deployments, because when the swap operation hangs and eventually fails, it will hang for 20+ minutes. However, we have seen that restarting the blue slot while the swap is hanging is one way to get things running again, but we don't think that this is an acceptable long term solution.

Related command

Our deployment commands are approximately:

az webapp deployment source config-zip --resource-group rg --name appname --src artifact.zip --slot blue
az webapp deployment slot swap -g rg -n appname --slot blue --target-slot production

Errors

Failed swapping site. Error: Cannot swap site slots for site 'my-site' because the 'blue' slot did not respond to http ping.

Issue script & Debug output

These commands are not failing. It is the swap slot operation in Azure that is failing

Expected behavior

Swap should succeed

Environment Summary

azcliversion: 2.30.0

Additional context

This was reported twice before in GH issues, but closed with no resolution:
1. https://github.com/Azure/azure-functions-host/issues/8969
2. https://github.com/Azure/Azure-Functions/issues/2264
We also have an open support ticket with Azure. Ticket ID: 2406280040008216
We are NOT using App service cache. Our artifact size is 38 MB

yonzhan commented 2 months ago

Thank you for opening this issue, we will look into it.

kwrightepc commented 2 weeks ago

Restarting the slot during swap totally worked in getting my pipeline to go green, which is: 1) Amazing, would have not thought to do that on my own. 2) Completely bizarre and doesn't make sense.

I ran into this issue a few months back and honestly can't remember how it got resolved....

ridler commented 1 week ago

My team is experiencing this exact same issue, with a slightly different deployment strategy that I will share in case it is helpful for Azure to isolate the root cause:

# 1 - deploy latest version of docker image in ACR to the Staging slot
az webapp config container set --name app-service-name --resource-group rg-name --docker-custom-image-name image-name-in-ACR --slot Staging
# 2 - restart the Staging slot
az webapp restart --name app-service-name --resource-group rg-name --slot Staging
# 3 - I wrote a custom NodeJs script that polls the staging slot for a successful response to an HTTP GET request. It:
#       - waits 10 seconds after the restart command was issued (to be sure that we're not getting false-positive responses from the previous deployment)
#       - sends an HTTP GET request every 5 seconds until a status code < 400 is received
# 4 - swap the Staging slot into the Production slot
az webapp deployment slot swap --name app-service-name --resource-group rg-name --slot Staging
# step 4 often (yet inconsistently) fails with me message: Cannot swap site slots for site 'app-service-name' because the 'Staging' slot did not respond to http ping.

Step 3 blocks the pipeline's progress until we have reason to believe the Staging slot is healthy. It is essentially an automated way of verifying the original post's claim:

our blue slot always responds successfully to our pings when we hit it manually during these operations... These commands are not failing. It is the swap slot operation in Azure that is failing

The issues described in the original post align with my team's experiences. Specifically

Further, this doesn't happen consistently, but started to crop up more often (~40-50% of our deploys across two totally separate directories) since 2024-06-28.

We don't have such precise times and measurements, but this issue has come up significantly more in the last few months (around June/July 2024)

This is highly disruptive for our deployments, because when the swap operation hangs and eventually fails, it will hang for 20+ minutes. However, we have seen that restarting the blue slot while the swap is hanging is one way to get things running again, but we don't think that this is an acceptable long term solution.

We're having the exact same experience.

jchli commented 1 week ago

I have an update for you folks, as I've spent significant time talking about the issue with Azure Support. Disclaimer, this is just from my conversations with them and I don't think it guarantees that you have the same root cause.

Quoting from my emails:

We confirmed this is happening because the instance(s) intermittently can't process the change notifications due to a caching issue. The official RCA for this was provided by the Product Group on another Support Request from about a month ago:

The App Service team has investigated your issues related to failing slot swaps. Engineers identified a race condition in the swap process that can cause delays in app setting propagation, which can then cause swap failure. While this is not new behavior, it can be hit more often with more workers in the ASP and/or any slowness in KeyVault secret resolution during swap. The cause of the race condition has been identified and a permanent fix will roll out in the next release. In the meantime, some platform behavior has been disabled to prevent your hitting this race condition. We do not expect your slot swap issues to re-occur.

As a mitigation, the Product Group disabled a platform behavior that should prevent you from encountering the problem on this App Service's stamp. In addition to that, it has also been identified the problem can be prevented by restarting the deployment slot, which you have been doing ever since.

Apparently, this will be fixed in ANT 103 (you can check the version via the Kudu console), but no word on when that will be released.

I do not recommend reaching out to Azure support to have them flip this hidden, special switch. I think that poses more risk, as now you're running on a "nonstandard" version of the PaaS which, in my opinion, kind of defeats the purpose of a PaaS.

Azure / azure-cli