Unable to accelerate cleanup of succeeded workflows with TTL workers

n1klasD commented 10 months ago

Pre-requisites

[X] I have double-checked my configuration
[X] I can confirm the issues exists when I tested with :latest
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We are experiencing difficulties in increasing the number of TTL (time-to-live) workers to faster clean up succeeded workflow.. We have set the --workflow-ttl-workers flag of the controller to higher values, such as 64 or 96, but it appears that the workflows controller does not recognize or utilize these values effectively. Despite configuring a higher number of TTL workers, the Prometheus metric "argo_workflows_workers_busy_count" consistently shows only one busy TTL worker, and not the expected 64 or 96.

This issue is causing a build-up of succeeded workflows over time, currently peaking at around 10,000, because we are creating workflows faster than we can clean up. Our intention was to leverage multiple TTL workers to accelerate the clean-up process and reduce the backlog of succeeded workflows. It's worth noting that our machine's CPU resources for the controller are only utilized at a fraction (approximately 5% of a 16-core machine).

Additionally, we are observing frequent logs indicating that workflows are being cleaned up late, as seen below.

Version

3.4.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL"

Logs from in your workflow's wait container

sgutwein commented 10 months ago

Here are some further screenshots from our monitoring system:

You can see, that the number of busy ttl workers never surpasses one, and often even no worker is busy. We expect, that this metric (argo_workflows_workers_busy_count) approaches the number set by --workflow-ttl-workers when there are many workflows to be cleaned up:

Annotation-354253

Here you can see the number of succeeded workflows, which are not cleaned up fast enough:

Annotation 2023-11-15 1431299

Strangely, the metric for the depth of the workflows queue is always at zero:

Annotation 2023-11-15 143123

But we see a high frequency of the logs time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL":

Annotation 2023-11-15 1431209

n1klasD commented 10 months ago

I´d like to add a point to that. It seems like the workflow garbage collection sometimes just stops for one hour or so, until I delete the workflow-controller pod. Then it will work for a while, cleaning up a lot of succeeded workflows, before stopping again. We currently use this as a workaround, when too many succeeded workflows have accumulated.

Apart from setting --workflow-ttl-workers to 64 or 128, we also tried setting the environment variable WORKFLOW_GC_PERIOD in the controller from the default value of 5m to 2m, to increase the periodicity for GC of workflows. We did not see any improvements so far though.

I read through some other issues, and it seems to be related to this (or at least the comment under this issue): https://github.com/argoproj/argo-workflows/issues/4634#issuecomment-747068867

agilgur5 commented 10 months ago

We have set the --workflow-ttl-workers flag of the controller to higher values, such as 64 or 96, but it appears that the workflows controller does not recognize or utilize these values effectively

So this corresponds directly to the number of goroutines launched for the GC controller -- quite literally this for loop.

Whether it utilizes it effectively, can't quite say on that one. Sounds like it may be a no based on your report. There is a queue, which could limit concurrency to an extent.

Strangely, the metric for the depth of the workflows queue is always at zero:

I was gonna say to check the workflow_ttl_queue metric, huh 🤔 😕

Does the queue depth of the other queues make sense? i.e. cron_wf_queue, pod_cleanup_queue, workflow_queue -- is workflow_ttl_queue the only odd one out that's not correct?

You can see, that the number of busy ttl workers never surpasses one, and often even no worker is busy. We expect, that this metric (argo_workflows_workers_busy_count) approaches the number set by --workflow-ttl-workers when there are many workflows to be cleaned up:

Yea I would expect this too... The logic looks correct to me too and is near identical to other parts of the codebase 🤔 😕

Same as above, does the busy_count look correct for the other queues? Or is it just workflow_ttl_queue that seems off?

It's worth noting that our machine's CPU resources for the controller are only utilized at a fraction (approximately 5% of a 16-core machine).

That's a good note since fully utilized CPU would certainly limit concurrency. I'm assuming you have your k8s resource requests and limits set appropriately for the Controller Deployment too?

WORKFLOW_GC_PERIOD in the controller from the default value of 5m to 2m, to increase the periodicity for GC of workflows. We did not see any improvements so far though.

This appears to be a slight misnomer; in the code this refers to the periodicity specifically of the GC for node status offloads.

The period for GC for Workflow TTL / retention is located in the same for loop as above: this line sets it to every second and it is not configurable.

I read through some other issues, and it seems to be related to this (or at least the comment under this issue): #4634 (comment)

It looks like the user there did not necessarily confirm the fix from #4736, which added a small 1s delay to GC as well as the configurable number of TTL workers

agilgur5 commented 10 months ago

But we see a high frequency of the logs time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL":

So this graph does look a little suspicious... this seems to log / run every 15-20 minutes? Any chance you can confirm the exact time period?

20 minutes corresponds to the Workflow Informer's resync period, which is when Informers rebuild their cache. So that would suggest there may be a bug with the Informer, and not the TTL queue or TTL GC necessarily... 🤔

This may be the same root cause as #11948 and therefore fixed by #12133.

Can you try the tag published there, dev-fix-informer-3.4.13?

n1klasD commented 10 months ago

Thanks for your extensive reply !

Whether it utilizes it effectively, can't quite say on that one. Sounds like it may be a no based on your report.

Either it does not utilize it effectively, or the metric for that reports incorrect values.

Does the queue depth of the other queues make sense? i.e. cron_wf_queue, pod_cleanup_queue, workflow_queue -- is workflow_ttl_queue the only odd one out that's not correct?

When I look at the argo_workflows_queue_depth_count metric, there are reasonable spikes in the cron_wf_queue and workflow_queue metrics, this looks good. Apart from workflow_ttl_queue being always at zero, pod_cleanup_queue is also always at zero. There are queue adds greater than zero (argo_workflows_queue_adds_count) for all four queues, indicating an error in the queue depth for workflow_ttl_queue and pod_cleanup_queue.

Same as above, does the busy_count look correct for the other queues? Or is it just workflow_ttl_queue that seems off?

When looking at the argo_workflows_workers_busy_count metric, I only see this problem with the number of busy workers for the workflow_ttl_queue, where the value does not spike higher than one worker (https://user-images.githubusercontent.com/46707060/283146674-e9d5e274-620d-4ead-907a-36f66bb4c682.png). For workflow_queue, this correctly spikes to our configured number of 64 workers under high load. I also don´t see this problem for pod_cleanup_queue. We do not set --cron-workflow-workers, but the number of busy cron workflow workers constantly spikes to 8. I am assuming, that this is the default.

I'm assuming you have your k8s resource requests and limits set appropriately for the Controller Deployment too?

Yes. we have set appropriate requests and limits for the controller deployment. These are lot higher than required actually, both in terms of memory and number of cpu cores.

n1klasD commented 10 months ago

This appears to be a slight misnomer; in the code this refers to the periodicity specifically of the GC for node status offloads.

Thanks for the clarification. In the docs for the environment variables, this is described as "The periodicity for GC of workflows.", hence the confusion.

n1klasD commented 10 months ago

But we see a high frequency of the logs time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL":

So this graph does look a little suspicious... this seems to log / run every 15-20 minutes? Any chance you can confirm the exact time period?

Yes, this is very consistently logged in batches every 18-20 minutes. That seems to align with the Workflow Informer's resync period of 20 minutes.

sgutwein commented 10 months ago

The periodic frequency of the "Queueing Succeeded workflow ..." logs can be seen here:

grafik

n1klasD commented 10 months ago

Can you try the tag published there, dev-fix-informer-3.4.13?

We currently only have this problem on our production environment because of the increased load, so we are naturally very hesitant to test this there. I´ll get back to you on that though. Do you have any idea, when this fix will be released ?

agilgur5 commented 10 months ago

Do you have any idea, when this fix will be released ?

The informer fix has been released as part of 3.4.14 now (and 3.5.2).

Are you able to try the new version and see if it is fixed?

We currently only have this problem on our production environment because of the increased load

So you could check in a staging environment if the periodic frequency of the "Queueing Succeeded workflow [...] for delete [...] due to TTL" logs changes from ~20min to more frequently. If that's the root cause (which sounds pretty likely right now), then a change in the frequency in staging would indicate it would fix the production issue as well.

n1klasD commented 10 months ago

I tested 3.4.14 in a staging environment and the frequency of the "Queueing Succeeded workflow [...] for delete [...] due to TTL" logs definitely increased. I don´t see this periodically happening any more, that's great!

However the workflow queue depth metric argo_workflows_queue_depth_count still seems to be incorrect and shows the same behaviour as described above. I know, that this was not part of this fix though !

tooptoop4 commented 7 months ago

https://github.com/argoproj/argo-workflows/issues/12659 means that twice as many pods could be going into cleanup queue than they should

argoproj / argo-workflows