Resource of type Deployment stuck in the Progressing state

lgob0 commented 1 year ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the argocd version.

Describe the bug

The Application CRD reconciliation stuck in the progressing state waiting for a Deployment resource to be ready despite it already is. The health message says:

Waiting for rollout to finish: observed deployment generation less than desired generation

At the same time the Deployment resource is ready and both metadata.generation and status.observedGeneration are equal. From our observations this issue affects up to 11% of our daily deployments and a single occurrence may take ArgoCD up to 5 minutes to realize that the Deployment resource is ready. We observed the issue only on created Application resources, not on the updated ones.

To Reproduce

There is no simple way to easy reproduce a single occurrence of this issue. We find this behavior as completely random. In our case with ~300 deployments a day there is always up to few dozens affected.

Expected behavior

The Application CRD resource is ready up to a few seconds after every deployed resource is ready, including the described case.

Version

v2.5.5 from the helm chart version 5.16.14

Example

Resources caught kubectl get during one incident.

application.yaml.txt deployment.yaml.txt

agaudreault commented 1 year ago

I will use this issue as the main one for my investigation.

I was able to reproduce and I experienced the same problem, mainly with Deployment. Although, based on my findings, I believe that it is caused by a sequence of events that are not specific to the "Deployment" kind and might affect other resources.

In the Deployment scenario, it would seem that it only happens on Pod scale down. To reproduce, I used kubectl to scale down the deployment argocd-repo-server.

The logs below seem to show that

the Deployment resource is initially updated
the application reconcile/refresh (level 1) is queued
App starts to process the reconciliation operation for application with resourceVersion 239567632
During the reconcile, other updates to the Deployment resource are handled and the application reconcile/refresh (level 1) is queued
during the reconcile, the Application health changes from Healthy->Progressing (now version 239568263)
Reconcile completes, and a new one starts
PROBLEM -> The new reconcile uses ctrl.appInformer.GetIndexer().GetByKey(appKey.(string)) to get the Application, but it returns version 239567632. It should return 239568263
Health is evaluated to Healthy, but since version 239567632 was already Healthy, the status is not updated.
Pods are updated, and since pods are not a direct part of the app, a reconcile is triggered with level 0.
reconcile starts, but use the cache and will not update health, only resource tree.
No more updates to resource. App is still in Progressing stuck until a manual or external refresh is triggered with level 1 or more.

Logs

nabeelaccount commented 1 year ago

Hi, we are experiencing this issue. Can you share any updates on this please?

argoproj / argo-cd

Resource of type Deployment stuck in the Progressing state #14266