When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.
At the moment, we can see here that for every API resource kind, we create a separate goroutine that:
The function passed as an argument in RetryUntilSucceed is retried again (since it failed explicitly after that 10 min timeout).
Since RV was set to empty string in the aforementioned deferred method, we reload the state here by kicking off LIST call directly to etcd once again with the page size equal to 500.
This approach has a couple of problems:
Lists issued to etcd are much more heavy-weight than lists issued to the kube-apiserver's watch cache.
When using the watch cache, kube-apiserver simply gets a copy of all of the resources from the cache (which already contains deserialized data) and sends it to the client.
Otherwise, kube-apiserver has to get this information from etcd directly (applying non-trivial load to it), decode and deserialize it when fetching it from there.
Default page size of 500 for K8s API calls imposes lots of (paginated) etcd list calls if there are lots of resources of a particular kind.
This multiplies the overload effect from the above point.
If you have a ginormous amount of a particular resource kind, e.g. 150 thousand Pods, the List API call with small page size takes ages and might be continuously hitting error 410 after falling off etcd compaction window which defaults to 1min.
To Reproduce
Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.
Observe logs of kube-apiserver to see periodic (every 10 minutes) LISTs of all resources issued directly to etcd (no resourceVersion parameter in the URI of logged API request) rather than to the kube-apiserver's watch cache (resourceVersion=0 in the URI string).
Expected behavior
argocd-application-controller's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. with resourceVersion=0) and follows it with WATCH requests only (with increasing RV).
Checklist:
argocd version
.Describe the bug
When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.
At the moment, we can see here that for every API resource kind, we create a separate goroutine that:
timeout
parameter to the WATCH request options, instead, you run the watch connection (for 10 min by default) by stopping the watcher here and nullifying the RV there as well.RetryUntilSucceed
is retried again (since it failed explicitly after that 10 min timeout).etcd
once again with the page size equal to 500.This approach has a couple of problems:
etcd
are much more heavy-weight than lists issued to thekube-apiserver
's watch cache.kube-apiserver
simply gets a copy of all of the resources from the cache (which already contains deserialized data) and sends it to the client.kube-apiserver
has to get this information frometcd
directly (applying non-trivial load to it), decode and deserialize it when fetching it from there.etcd
list calls if there are lots of resources of a particular kind.etcd
compaction window which defaults to 1min.To Reproduce
Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.
Observe logs of
kube-apiserver
to see periodic (every 10 minutes) LISTs of all resources issued directly toetcd
(noresourceVersion
parameter in the URI of logged API request) rather than to thekube-apiserver
's watch cache (resourceVersion=0
in the URI string).Expected behavior
argocd-application-controller
's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. withresourceVersion=0
) and follows it with WATCH requests only (with increasing RV).Screenshots
Version
Logs
Logs from
kube-apiserver
for Pods from my small dev cluster I used for debugging: