argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.06k stars 3.2k forks source link

Improved requeue behaviour with back-off #12381

Open mrob95 opened 11 months ago

mrob95 commented 11 months ago

Summary

We have quite a few workflows which have many short (often <1s, small images) steps. These seem to run slower than they should.

From reading previous issues (particularly https://github.com/argoproj/argo-workflows/issues/6262 and https://github.com/argoproj/argo-workflows/pull/9729) and docs/code (https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/rate_limiters.go#L17) it seems that the DEFAULT_REQUEUE_TIME config variable controls how often Argo checks to see whether steps have completed. So if this is set to the default of 10s, a pod that completes in 1s will take an additional 9s to be registered as completed by Argo so that the next step can begin. The trade-off is that lower values cause more load on the k8s API.

Rather than just a simple "check every X seconds" policy, would it make sense to have a rate limiter that checks frequently when pods are young and backs off as they run longer? This would mean that workflows with lots of short steps would complete snappily while avoiding spamming the API for longer steps.

An example policy could be something like "requeue every elapsed_time/5 seconds, up to a maximum of 30s". This would guarantee that <20% of the runtime of a step is spent waiting for Argo to register that it has finished.

If there is agreement on what a better policy would look like then I'd be happy to submit a PR to implement it.

Thanks!


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Beyond this issue:

jswxstw commented 10 months ago

I think you misunderstood the purpose of the DEFAULT_REQUEUE_TIME configuration, Workflow Controller relies on the PodInformer to determine if the step is completed, rather than detecting it periodically. DEFAULT_REQUEUE_TIME is likely intended to slow down the scheduling of workflows in scenarios where the cluster contains a large number of workflows with short steps, in order to reduce the load of the API Server and Workflow Controller.

mrob95 commented 10 months ago

I see, thanks for the clarification. So this means that if 100 steps for a workflow complete within the same 10s window, the workflow only gets queued once, 10s after the completion event was received for the first step.

https://github.com/kubernetes/client-go/blob/master/util/workqueue/delaying_queue.go#L310-L313

If I'm understanding correctly this still seems like it will often slow down workflows unnecessarily. In an extreme case, a workflow with 10 steps that are sequentially dependent and each take 1s to run will wait 10s to requeue after the completion of each step, giving a total runtime of 100s rather than 10.

For my use case I think it makes sense to drop the default requeue time to 1s, since we are not running many workflows in parallel and API load is not a huge concern.

jswxstw commented 10 months ago

If I'm understanding correctly this still seems like it will often slow down workflows unnecessarily. In an extreme case, a workflow with 10 steps that are sequentially dependent and each take 1s to run will wait 10s to requeue after the completion of each step, giving a total runtime of 100s rather than 10.

For my use case I think it makes sense to drop the default requeue time to 1s, since we are not running many workflows in parallel and API load is not a huge concern.

It is a good choice to reduce the default requeue time in your scenario, because the default configuration will ensure that each step takes at least 10 seconds.

jswxstw commented 10 months ago

Actually, I also faced the challenge of accelerating workflow execution. I'm curious about how you achieved each step completing in less than 1 second. When I use the Container Template, it takes around 6 seconds just to start the pod.

mrob95 commented 10 months ago

The 1s figure was based on intuition about the size of the image (a few megabytes for Alpine) and the amount of work the steps are doing rather than any rigourous profiling. I just did a quick test using kubectl run and it looks like the lower bound for pod execution is about 3-4s on my cluster.

$ time kubectl run alpine-test --image=alpine --restart=Never --rm -it --command -- sh -c 'echo hello'
hello
pod "alpine-test" deleted
kubectl run alpine-test --image=alpine --restart=Never --rm -it --command --   0.21s user 0.07s system 7% cpu 3.512 total

Did you find any other ways of improving the execution time of workflows other than DEFAULT_REQUEUE_TIME?

jswxstw commented 10 months ago

Did you find any other ways of improving the execution time of workflows other than DEFAULT_REQUEUE_TIME?

Using HTTP/Plugin Template instead of Container Template is a feasible solution. Steps of HTTP/Plugin Template will execute in one same agent pod which can save a lot of pod creation time. By the way, DEFAULT_REQUEUE_TIME is still extremely important because the default value of ARGO_AGENT_PATCH_RATE is also DEFAULT_REQUEUE_TIME.

carolkao commented 4 months ago

I think you misunderstood the purpose of the DEFAULT_REQUEUE_TIME configuration, Workflow Controller relies on the PodInformer to determine if the step is completed, rather than detecting it periodically. DEFAULT_REQUEUE_TIME is likely intended to slow down the scheduling of workflows in scenarios where the cluster contains a large number of workflows with short steps, in order to reduce the load of the API Server and Workflow Controller.

Hi @jswxstw,

I have a similar question. I have a workflow with only one step. It took about 20s to complete the step, but the duration of the workflow is 30s. It's 10 seconds delay. If I reduce DEFAULT_REQUEUE_TIME to 5 seconds, and the workflow duration will become 25s. As you mentioned, "Workflow Controller relies on the PodInformer to determine if the step is completed", looks like workflow controller didn't reconcile the corresponding workflow when the pod is complete according to the below logs with 5s DEFAULT_REQUEUE_TIME:

time="2024-06-26T01:27:12.148Z" level=info msg="Created pod: carol-ws.exec-script-exe-main-s8q6w[1].execute-script (carol-ws.exec-script-exe-main-s8q6w-1061060677)" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:12.148Z" level=info msg="Workflow step group node carol-ws.exec-script-exe-main-s8q6w-2536539054 not yet completed" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:12.148Z" level=info msg="TaskSet Reconciliation" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:12.149Z" level=info msg=reconcileAgentPod namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:12.163Z" level=info msg="Workflow update successful" namespace=canal-flow phase=Running resourceVersion=569838 workflow=carol-ws.exec-script-exe-main-s8q6w

// --> 5s later (DEFAULT_REQUEUE_TIME)
time="2024-06-26T01:27:17.152Z" level=info msg="Processing workflow" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.153Z" level=info msg="Task-result reconciliation" namespace=canal-flow numObjs=0 workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.153Z" level=info msg="node changed" namespace=canal-flow new.message= new.phase=Running new.progress=0/1 nodeID=carol-ws.exec-script-exe-main-s8q6w-1061060677 old.message= old.phase=Pending old.progress=0/1 workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.154Z" level=info msg="SG Outbound nodes of carol-ws.exec-script-exe-main-s8q6w-2414852609 are [carol-ws.exec-script-exe-main-s8q6w-2414852609]" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.155Z" level=info msg="Workflow step group node carol-ws.exec-script-exe-main-s8q6w-2536539054 not yet completed" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.155Z" level=info msg="TaskSet Reconciliation" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.155Z" level=info msg=reconcileAgentPod namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:17.166Z" level=info msg="Workflow update successful" namespace=canal-flow phase=Running resourceVersion=569858 workflow=carol-ws.exec-script-exe-main-s8q6w

// during this period, the pod was completed at 01:27:34
// but there is no any reconcile log.. until 5 seconds later

time="2024-06-26T01:27:39.294Z" level=info msg="Processing workflow" namespace=canal-flow workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:39.294Z" level=info msg="Task-result reconciliation" namespace=canal-flow numObjs=1 workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:39.294Z" level=info msg="task-result changed" namespace=canal-flow nodeID=carol-ws.exec-script-exe-main-s8q6w-1061060677 workflow=carol-ws.exec-script-exe-main-s8q6w
time="2024-06-26T01:27:39.294Z" level=info msg="node changed" namespace=canal-flow new.message= new.phase=Succeeded new.progress=0/1 nodeID=carol-ws.exec-script-exe-main-s8q6w-1061060677 old.message= old.phase=Running old.progress=0/1 workflow=carol-ws.exec-script-exe-main-s8q6w

The version I'm using is v3.4.8 Did I misunderstand anything? What should I do if I want to reduce this 10 seconds gap?

jswxstw commented 4 months ago

Did I misunderstand anything? What should I do if I want to reduce this 10 seconds gap?

If the workflow or its pod is changed, informer will get this update event and put the workflow into wfQueue using wfc.wfQueue.AddRateLimited. If DEFAULT_REQUEUE_TIME is configured to 5s, then the same workflow can only be put into queue every 5s. Additionally, there will be an interval of time, which is uncontrollable, before the informer truly perceives the update of the pod status.

carolkao commented 4 months ago

Hi @jswxstw , thanks for your prompt reply. I'd like to confirm whether, once the workflow is put into queue, it will be reconciled immediately? (Assuming there is only one workflow in the environment.) Or is there any possible latency between the workflow being added to the queue and being reconciled?

jswxstw commented 4 months ago

wfWorkers is responsible for processing the workflow in wfQueue, it will be processed immediately if there is only one workflow.