argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.01k stars 5.49k forks source link

"Waiting for healthy state"… #17314

Open roy-work opened 9 months ago

roy-work commented 9 months ago

Checklist:

Describe the bug

I've got an application (Mimir, a Grafana component) stuck in "waiting for healthy state of apps/StatefulSet/mimir-alertmanager and 8 more resources".

To Reproduce

Not entirely sure.

But I've deployed other applications in the past that similarly failed on their first deployment (… these things are just complicated…) and I've not yet encountered this behavior from ArgoCD, so I'm not sure what is unique about Mimir here.

But I'm deadlocked: Argo won't let me just attempt a new sync, as it thinks a "prior sync is already running". The pods won't fix themselves (the config is truly incorrect) … so deadlock?

Expected behavior

I can't even fathom what's going on here; this is to me a logical deadlock. If an application fails to deploy, we're almost certainly going to follow that broken deploy up with a configuration fix, and re-attempt stuff.

I don't need ArgoCD to get in my way, here, and I don't know why it is in my way.

Definitely not deadlocking me out of attempting deployments…

Screenshots

Version

v2.8.5+85025e1, from the UI.

Logs

I'm not sure Argo is emitting anything useful in the logs here.

argocd-application-controller-0 argocd-application-controller time="2024-02-26T22:52:01Z" level=info msg="Refreshing app status (controller refresh requested), level (0)" application=monitoring/mimir
rgocd-application-controller-0 argocd-application-controller time="2024-02-26T22:52:01Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for healthy state of apps/StatefulSet/mimir-alertmanager and 8 more resources" application=monitoring/mimir
argocd-application-controller-0 argocd-application-controller time="2024-02-26T22:52:01Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: monitoring)" application=monitoring/mimir
argocd-application-controller-0 argocd-application-controller time="2024-02-26T22:52:01Z" level=info msg="getRepoObjs stats" application=monitoring/mimir build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=28 unmarshal_ms=27 version_ms=0
argocd-application-controller-0 argocd-application-controller time="2024-02-26T22:52:01Z" level=info msg="No status changes. Skipping patch" application=monitoring/mimir
argocd-application-controller-0 argocd-application-controller time="2024-02-26T22:52:01Z" level=info msg="Reconciliation completed" application=monitoring/mimir dest-name= dest-namespace=monitoring dest-server="https://kubernetes.default.svc" fields.level=0 time_ms=48
roy-work commented 9 months ago

Ugh! There's a "Terminate" button under "Sync Status", I did not realize.

… my gut was looking for something like that, but I thought it would be under the hamburger menu next to the Sync. (Which … that menu is also broken?)

Anyways, terminating it caused it to then attempt a new sync, and apply the fix. (My deploy is still broken, but that's another matter, and I at least have a new reason for that…)

I'm still perplexed as to why now, of all times, it chooses to block me?

truepele commented 4 months ago

it affects me to. My deployment model assumes minimal interventions on argocd side. When I have a poisoned commit with faulty deployment definition - argocd sync gets stuck, a new commit which fixes the deployment definition does not have a chance to fix the issue as Argocd waits for previous sync to complete (waiting for healthy state of sts). We are speaking GitOps here, so how do I make it out of the deadlock by pushing a fix to Git?

truepele commented 3 months ago

@alexmt is there any workaround to the problem?

Can I configure a timeout for the sync operation? If the deployments do not get healthy in a time configured - terminate the sync and let next scheduled sync to pick up and apply the last commit which fixes the problem. Is there such timeout configuration available?

This is a real deal breaker for me... please, any help will be greatly appreciated!

piljaechae commented 2 months ago

Also having the same issue here.

blixem777 commented 1 month ago

same for me

andrii-korotkov-verkada commented 1 week ago

What are your argocd versions?