Open tooptoop4 opened 1 month ago
shortening might solve #13671 / #10947 (which is linked to a k8s client bug)
That would be a workaround, not a solution. Cache rebuilds are expensive, especially if you have a large amount of Workflows. We leave it at the k8s default, so if it's not tuned in Argo, making it user configurable is a bit confusing, to say the least.
There's also one of these for every informer
Also please fill out the issue templates in full, especially if you want to be a good role model to others.
@agilgur5 can u clarify expensive in what terms? (k8s api calls, controller cpu/memory? something else?) that might be preferable than missing SLAs for me
from reading https://github.com/kubernetes/kubernetes/issues/127964 and https://github.com/kubernetes/client-go/issues/571 informer seems unreliable compared to list current state
so choice seems to be rely on events/cache for what workflows should be operated on (non-0 chance of some missing) vs simple list all workflows (guaranteed to have all)
All of the above. It can do a full relist, which is k8s API and network I/O expensive, and iterates through the entire cache, which uses CPU and memory. Depending on your usage, you might be able to see the rebuild as a clear spike in your metrics as with https://github.com/argoproj/argo-workflows/issues/12206#issuecomment-1812635873
In https://github.com/argoproj/argo-workflows/issues/12125#issuecomment-1791396968 (I forgot that issue existed, very similar) and https://github.com/argoproj/argo-workflows/pull/13466#issuecomment-2294907547 I linked to some readings upstream in https://github.com/kubernetes-client/java/issues/725#issuecomment-540352746, this k8s SIG API Machinery Google Group thread, https://github.com/argoproj/gitops-engine/pull/617#discussion_r1698660165. According to those, Informers are supposed to be quite stable now and no longer relist, although unclear if that applies outside of "core controllers". But core controllers, kubebuilder, controller-runtime, etc all make heavy use of Informers, so they're an essential piece of k8s controllers upstream, and not necessarily something Argo should be working around if there are bugs.
I would say it's more an upstream issue if that even makes sense to expose to users, since it seems like k8s maintainers don't recommend changing the default for other tooling either.
that might be preferable than missing SLAs for me
that's a bit of a different question that is potentially worth exposing in its own right, although the argument against that would be that if Informers are acting up, your entire cluster is going to be having some problems, not just Argo
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
/unrotten
/unrotten
This is still missing information...
according to https://github.com/kubernetes/kubernetes/pull/128183#issuecomment-2440872602 not upstream issue
https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/controller.go#L165 is 20 minutes https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/controller.go#L167 is 30 minutes https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/controller.go#L170 is 20 minutes https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/taskresult.go#L29 is 20 minutes
shortening might solve https://github.com/argoproj/argo-workflows/issues/13671 / https://github.com/argoproj/argo-workflows/issues/10947 (which is linked to a k8s client bug) / https://github.com/argoproj/argo-workflows/issues/12352 https://github.com/argoproj/argo-workflows/issues/1038#issuecomment-485037426 https://github.com/argoproj/argo-workflows/issues/1416#issuecomment-511523810 https://github.com/argoproj/argo-workflows/issues/568#issuecomment-350212931 https://github.com/argoproj/argo-workflows/issues/532#issue-278962758 https://github.com/argoproj/argo-workflows/issues/3952 https://github.com/argoproj/argo-workflows/blob/54621cc60117cf68183be24322119d85a80bb650/workflow/controller/taskresult.go#L91-L93 https://github.com/argoproj/argo-workflows/pull/4423