Workflow / event pods block scale down

elizabethking2 commented 2 years ago

Checklist

[X] Double-checked my configuration.
[x] Tested using the latest version.
[X] Used the Emissary executor.

Versions

Argo workflows version: v3.3.6
Argo events version: 1.12.0
K8s GKE version: v1.21.1

Summary

After a large event kicked off 100s of workflows several weeks ago, our prod cluster has not been able to scale back down. Both workflow and event pods block the cluster scale down due to Pod is blocking scale down because it has local storage and Pod is blocking scale down because it's not backed by a controller in GKE: Screenshot 2022-08-31 at 11 28 11

What is the best practice here? Both these warning suggest adding a safe-to-evict annotation - is this safe to add?

Worth noting that both CPU and memory utilisation are low: Screenshot 2022-08-31 at 11 32 15

Additionally, we've implemented pod disruption budgets to reduce the chance of voluntary disruption of workflow pods. In the meantime are investigating internally if this could be one factor blocking the scale down after a surge.

Diagnostics

This can be reproduced by kicking off +100 workflows that sleep for 1000+ seconds.

We see ~13k logs / hour on local storage scale down issues: Screenshot 2022-08-31 at 11 23 45

And ~100 logs / hour on the no controller scale down issue: Screenshot 2022-08-31 at 11 58 41

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

sarabala1979 commented 2 years ago

It looks like more related to the GKE issue.

alexec commented 2 years ago

+1 to @sarabala1979. Does not seem like an Argo issue. Please re-open if you have more info.

dis-sid commented 3 days ago

Hello, no one answered his question though, GKE needs pods to be managed by something deployment, statefulset etc or else use the safe-to-evict annotation to be able to scale down the number of nodes in the cluster. Is this argo "orchestrator" pod safe to evict ? (I don't know the implementation details but if it retries until success without side effects I'd consider it "safe" enough to be evicted on scale downs)

agilgur5 commented 2 days ago

Is this argo "orchestrator" pod safe to evict ?

The Controller is designed to be resilient to restarts as it stores all state in its managed CRs. Intermediate state, such as Pod changes, may be missed during downtime however. See also the "High Availability" documentation.

needs pods to be managed by something deployment, statefulset etc

The Controller is also backed by a Deployment currently.

This issue is asking about individual Workflow Pods though, for which it entirely depends on how you designed your tasks -- Argo cannot answer that for you as it is in user-land.

argoproj / argo-workflows