JPZ13 commented 1 year ago

Summary

We're working with a customer who is running GKE and Argo Workflows for data processing workloads. Occasionally, the CPU and memory on the workflow controller grows exponentially causing the workflow controller to crash. We think this is being caused by the pod cleanup being rate limited by the GKE API and the internal queue backing up, but we're not sure because there isn't clear logging telling us that it's happening

Note: We spoke with @sarabala1979 during the last contributors' meeting about a similar issue. I'm creating this issue to track out work around getting better logging around rate limiting

Use Cases

This would be useful for users whose K8s distro has some sort of rate limiting to have better insights on what's happening during the CPU/memory buildup

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

JPZ13 commented 1 year ago

During our sync today, we discussed adding batch methods based on labels for deleting large numbers of Kubernetes resources as a potential solution. Additionally, we reached the conclusion that this might be caused by the pod informer getting backed in addition to or instead of the queue

JPZ13 commented 1 year ago

Assigning to @isubasinghe. We spoke about Delete Collection on our sync. Combined with a label selector for the list option, we think this is a good way to avoid rate limiting

isubasinghe commented 1 year ago

Okay so having had a look at this, I'm not convinced this has a (nice || easy) solution. There is no way to manually delete a collection of pods, without accidentally deleting ones yet to be completed unless we delete them after we label them as completed. For the labelling to happen, we need to add them to the work queues anyway, it is plausible that this may solve the issue for some workflows, but we would be just kicking the can along the road. The reason it may help is because we will pretty much half the items added to the queue but obviously if we double the amount of pods generated in a workflow, we will once again rediscover this problem.

There are some solutions here, we could introduce a deleteStrategy field in the YAML that indicates to skip all the queuing and do a big batch deletion whenever the workflow itself is completed.

The other solution is to get rid of the workqueues all together, I am not really sure why they are even needed in the first place honestly. The workflow controller is just a single process as far as I am aware (please verify this), it reads and writes to the queue, this seems to be a waste of compute in addition to rate limitation problems. The queues do make sense if we ever expand to more than one controller, but in the case that the same process does the writing and the reading, I believe it to be just pure overhead.

EDIT: The work queues actually are controller local, there is no benefit in having them at all it seems, it is just an in memory map. We can actually label pods as completed and then delete them eventually when persistUpdates gets called. It doesn't seem as bad as the situation above.

@JPZ13

JPZ13 commented 1 year ago

Thanks for diving in, @isubasinghe! Let's split this up into a short-term solution and a medium-term solution. In the short term, I think your idea about skipping the queueing and doing a big batch deletion whenever the workflow is completed is very viable. I'd suggest the following adjustments on the implementation though which should prevent us from needing a deleteStrategy field:

When a workflow is deleted, batch delete all the pods and bypass the work queue
We could use the podGC field to specify an OnWorkflowCompletion option that allows the user to tell Argo to batch delete the pods and keep the workflow in memory

We should look at potentially deprecating the work queues as a medium term solution since it's a broader refactor with a lot more surface area and much more potential for breakages

What are your thoughts @isubasinghe? cc @mostaphaRoudsari @AntoineDao

JPZ13 commented 1 year ago

Circling back, I spoke with @sarabala1979 during the issue triage sync, and we decided that the above referenced implementation of batch deleting pods on workflow deletion and adding the OnWorkflowCompletion option to podGC is a good solution. @isubasinghe, can you go ahead and begin work on it?

isubasinghe commented 1 year ago

@JPZ13 Yeah that sounds like a good solution to me. I'm working on it now.

alexec commented 1 year ago

I've looked at the PR as @sarabala1979 's request. I would like to know - could we use the DeleteCollection API?

JPZ13 commented 1 year ago

I've looked at the PR as @sarabala1979 's request. I would like to know - could we use the DeleteCollection API?

Hi @alexec - the PR does use delete collection on this line: https://github.com/argoproj/argo-workflows/pull/10379/files#diff-e06997e9396a4f740b36a747e168e49b12a757c9cbbce910d20903fde8b7beceR126

Could we move discussion of this PR to the PR itself? I'm also going to close all of the draft PRs to make sure we have a single source of truth for discussion on this going forward

tooptoop4 commented 9 months ago

https://github.com/argoproj/argo-workflows/issues/12659 means that twice as many pods could be going into cleanup queue than they should

argoproj / argo-workflows

Handle rate limiting for large operations #10219

Summary

Use Cases