Open JPZ13 opened 1 year ago
During our sync today, we discussed adding batch methods based on labels for deleting large numbers of Kubernetes resources as a potential solution. Additionally, we reached the conclusion that this might be caused by the pod informer getting backed in addition to or instead of the queue
Assigning to @isubasinghe. We spoke about Delete Collection on our sync. Combined with a label selector for the list option, we think this is a good way to avoid rate limiting
Okay so having had a look at this, I'm not convinced this has a (nice || easy) solution. There is no way to manually delete a collection of pods, without accidentally deleting ones yet to be completed unless we delete them after we label them as completed. For the labelling to happen, we need to add them to the work queues anyway, it is plausible that this may solve the issue for some workflows, but we would be just kicking the can along the road. The reason it may help is because we will pretty much half the items added to the queue but obviously if we double the amount of pods generated in a workflow, we will once again rediscover this problem.
There are some solutions here, we could introduce a deleteStrategy
field in the YAML that indicates to skip all the queuing and do a big batch deletion whenever the workflow itself is completed.
The other solution is to get rid of the workqueues all together, I am not really sure why they are even needed in the first place honestly. The workflow controller is just a single process as far as I am aware (please verify this), it reads and writes to the queue, this seems to be a waste of compute in addition to rate limitation problems. The queues do make sense if we ever expand to more than one controller, but in the case that the same process does the writing and the reading, I believe it to be just pure overhead.
EDIT: The work queues actually are controller local, there is no benefit in having them at all it seems, it is just an in memory map. We can actually label pods as completed and then delete them eventually when persistUpdates gets called. It doesn't seem as bad as the situation above.
@JPZ13
Thanks for diving in, @isubasinghe! Let's split this up into a short-term solution and a medium-term solution. In the short term, I think your idea about skipping the queueing and doing a big batch deletion whenever the workflow is completed is very viable. I'd suggest the following adjustments on the implementation though which should prevent us from needing a deleteStrategy
field:
podGC
field to specify an OnWorkflowCompletion
option that allows the user to tell Argo to batch delete the pods and keep the workflow in memoryWe should look at potentially deprecating the work queues as a medium term solution since it's a broader refactor with a lot more surface area and much more potential for breakages
What are your thoughts @isubasinghe? cc @mostaphaRoudsari @AntoineDao
Circling back, I spoke with @sarabala1979 during the issue triage sync, and we decided that the above referenced implementation of batch deleting pods on workflow deletion and adding the OnWorkflowCompletion
option to podGC
is a good solution. @isubasinghe, can you go ahead and begin work on it?
@JPZ13 Yeah that sounds like a good solution to me. I'm working on it now.
I've looked at the PR as @sarabala1979 's request. I would like to know - could we use the DeleteCollection
API?
I've looked at the PR as @sarabala1979 's request. I would like to know - could we use the
DeleteCollection
API?
Hi @alexec - the PR does use delete collection on this line: https://github.com/argoproj/argo-workflows/pull/10379/files#diff-e06997e9396a4f740b36a747e168e49b12a757c9cbbce910d20903fde8b7beceR126
Could we move discussion of this PR to the PR itself? I'm also going to close all of the draft PRs to make sure we have a single source of truth for discussion on this going forward
https://github.com/argoproj/argo-workflows/issues/12659 means that twice as many pods could be going into cleanup queue than they should
Summary
We're working with a customer who is running GKE and Argo Workflows for data processing workloads. Occasionally, the CPU and memory on the workflow controller grows exponentially causing the workflow controller to crash. We think this is being caused by the pod cleanup being rate limited by the GKE API and the internal queue backing up, but we're not sure because there isn't clear logging telling us that it's happening
Note: We spoke with @sarabala1979 during the last contributors' meeting about a similar issue. I'm creating this issue to track out work around getting better logging around rate limiting
Use Cases
This would be useful for users whose K8s distro has some sort of rate limiting to have better insights on what's happening during the CPU/memory buildup
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.