argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.97k stars 3.19k forks source link

[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

Open ADustyOldMuffin opened 1 year ago

ADustyOldMuffin commented 1 year ago

Summary

When trying to schedule jobs/templates in a workflow I'd like to schedule them on specific nodes, but these jobs are not critical and if I can't schedule them then I'd like to just skip them. The issue though is the current timeouts available don't count pending/unschedulable pods so if they can't be scheduled they hang up the entire workflow.

Use Cases

When you have a workflow with steps that might not schedule in Kubernetes, I'd like a way to time them out or stop them after sitting for so long in a pending state.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Gerrit-K commented 1 year ago

I think this request would fix a similar problem I've encountered a few times. We have CLI scripts that submit workflows and one of the user-controlled inputs is the container image. So if the input is wrong, it might cause invalid reference format on the pod, which is then effectively unschedulable. Of course, we can (and did) add checks on the client side to prevent this, but it would additionally be nice if argo was able to detect these cases and recover from it (i.e. fail automatically).

ElQDuck commented 8 months ago

I have a similar problem with pods in an infinit pending state because e.g. a referenced PVC cant be found. Non of the timeouts (activeDeadlineSeconds, timeout) work. I can see a message in argo with the problem but i cant define a timeout for such use case.

Unschedulable: 0/1 nodes are available: persistentvolumeclaim "non-existent" not found. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
tooptoop4 commented 5 months ago

i have a similar issue where new nodes go from: 1.kubelet ready

  1. kubelet network notready
  2. back to kubelet ready

where pods scheduled at point 1 become stuck in pending. would be great if argo had a pending timeout that when met would allow retrying to run new pod