dotnet / arcade-services

Arcade Engineering Services
MIT License
60 stars 74 forks source link

Production rollouts cannot stop the service properly #4086

Open premun opened 4 days ago

premun commented 4 days ago

Somehow the service never goes from Stopping to Stopped:

https://dev.azure.com/dnceng/internal/_build/results?buildId=2566414&view=logs&j=d834f0ef-b202-5dd2-50f7-dc59af38ca7d&t=c5f81511-ed74-5842-0962-8d98850568fa&l=270

Happened now 4 times in a row

premun commented 3 days ago

This is the first build it started happening in: https://dev.azure.com/dnceng/internal/_build/results?buildId=2566376&view=results

dkurepa commented 3 days ago

Looking at this now. I wonder if we had some kind of bad commit that's screwing us here, and if just forcefully rolling out would help

dkurepa commented 3 days ago

Okay I think I see what's going on. The replicas we're starting after a deployments appear to be wrong. This means that the newly deployed replicas always have the status Stopped after the deployment. So when the deployment is happening, we set the status to stopping, but a workitem is never finished, so it's never set to Stopped. It's a bug that got introduced in https://github.com/dotnet/arcade-services/pull/4072 I think. Another point is that we shouldn't set the status to Stopping if we're already Stopped

Also, this is happening in staging too

The question is how are scenario tests passing. It must be some revision that's been on doing all the work

dkurepa commented 3 days ago

The question is how are scenario tests passing. It must be some revision that's been on doing all the work

Yes this appears to be the case. We start the same revision we try to stop before. So currently, when we deploy, we run the scenario tests on the previous revision