argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15k stars 3.2k forks source link

Environment variables to configure (shorten) Informer ResyncPeriods #13690

Open tooptoop4 opened 3 weeks ago

tooptoop4 commented 3 weeks ago

https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/controller.go#L165 is 20 minutes https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/controller.go#L167 is 30 minutes https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/controller.go#L170 is 20 minutes https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/taskresult.go#L29 is 20 minutes

shortening might solve https://github.com/argoproj/argo-workflows/issues/13671 / https://github.com/argoproj/argo-workflows/issues/10947 (which is linked to a k8s client bug) / https://github.com/argoproj/argo-workflows/issues/12352 https://github.com/argoproj/argo-workflows/issues/1038#issuecomment-485037426 https://github.com/argoproj/argo-workflows/issues/1416#issuecomment-511523810 https://github.com/argoproj/argo-workflows/issues/568#issuecomment-350212931 https://github.com/argoproj/argo-workflows/issues/532#issue-278962758 https://github.com/argoproj/argo-workflows/issues/3952 https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/taskresult.go#L91-L93 https://github.com/argoproj/argo-workflows/pull/4423

agilgur5 commented 3 weeks ago

shortening might solve #13671 / #10947 (which is linked to a k8s client bug)

That would be a workaround, not a solution. Cache rebuilds are expensive, especially if you have a large amount of Workflows. We leave it at the k8s default, so if it's not tuned in Argo, making it user configurable is a bit confusing, to say the least.

There's also one of these for every informer

Also please fill out the issue templates in full, especially if you want to be a good role model to others.

tooptoop4 commented 3 weeks ago

@agilgur5 can u clarify expensive in what terms? (k8s api calls, controller cpu/memory? something else?) that might be preferable than missing SLAs for me

from reading https://github.com/kubernetes/kubernetes/issues/127964 and https://github.com/kubernetes/client-go/issues/571 informer seems unreliable compared to list current state

so choice seems to be rely on events/cache for what workflows should be operated on (non-0 chance of some missing) vs simple list all workflows (guaranteed to have all)

agilgur5 commented 3 weeks ago

All of the above. It can do a full relist, which is k8s API and network I/O expensive, and iterates through the entire cache, which uses CPU and memory. Depending on your usage, you might be able to see the rebuild as a clear spike in your metrics as with https://github.com/argoproj/argo-workflows/issues/12206#issuecomment-1812635873

In https://github.com/argoproj/argo-workflows/issues/12125#issuecomment-1791396968 (I forgot that issue existed, very similar) and https://github.com/argoproj/argo-workflows/pull/13466#issuecomment-2294907547 I linked to some readings upstream in https://github.com/kubernetes-client/java/issues/725#issuecomment-540352746, this k8s SIG API Machinery Google Group thread, https://github.com/argoproj/gitops-engine/pull/617#discussion_r1698660165. According to those, Informers are supposed to be quite stable now and no longer relist, although unclear if that applies outside of "core controllers". But core controllers, kubebuilder, controller-runtime, etc all make heavy use of Informers, so they're an essential piece of k8s controllers upstream, and not necessarily something Argo should be working around if there are bugs.

I would say it's more an upstream issue if that even makes sense to expose to users, since it seems like k8s maintainers don't recommend changing the default for other tooling either.

that might be preferable than missing SLAs for me

that's a bit of a different question that is potentially worth exposing in its own right, although the argument against that would be that if Informers are acting up, your entire cluster is going to be having some problems, not just Argo

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

tooptoop4 commented 2 days ago

/unrotten

agilgur5 commented 2 days ago

/unrotten

This is still missing information...