Application Controller's live state cache doesn't use watch cache when talking to Kubernetes

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.

At the moment, we can see here that for every API resource kind, we create a separate goroutine that:

Runs initial paginated listing from etcd with default page size equal to 500.
Kicks off another goroutine running the watchEvents method that implements the API call pattern in a following way:
- You don't provide the timeout parameter to the WATCH request options, instead, you run the watch connection (for 10 min by default) by stopping the watcher here and nullifying the RV there as well.
- The function passed as an argument in RetryUntilSucceed is retried again (since it failed explicitly after that 10 min timeout).
- Since RV was set to empty string in the aforementioned deferred method, we reload the state here by kicking off LIST call directly to etcd once again with the page size equal to 500.

This approach has a couple of problems:

Lists issued to etcd are much more heavy-weight than lists issued to the kube-apiserver's watch cache.
- When using the watch cache, kube-apiserver simply gets a copy of all of the resources from the cache (which already contains deserialized data) and sends it to the client.
- Otherwise, kube-apiserver has to get this information from etcd directly (applying non-trivial load to it), decode and deserialize it when fetching it from there.
Default page size of 500 for K8s API calls imposes lots of (paginated) etcd list calls if there are lots of resources of a particular kind.
- This multiplies the overload effect from the above point.
- If you have a ginormous amount of a particular resource kind, e.g. 150 thousand Pods, the List API call with small page size takes ages and might be continuously hitting error 410 after falling off etcd compaction window which defaults to 1min.

To Reproduce

Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.

Observe logs of kube-apiserver to see periodic (every 10 minutes) LISTs of all resources issued directly to etcd (no resourceVersion parameter in the URI of logged API request) rather than to the kube-apiserver's watch cache (resourceVersion=0 in the URI string).

Expected behavior

argocd-application-controller's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. with resourceVersion=0) and follows it with WATCH requests only (with increasing RV).

Screenshots

Version

argocd: v2.10.9+c071af8
  BuildDate: 2024-04-30T16:39:16Z
  GitCommit: c071af808170bfc39cbdf6b9be4d0212dd66db0c
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.10.9+c071af8

Logs

Logs from kube-apiserver for Pods from my small dev cluster I used for debugging:

INFO 2024-06-27T16:26:36.948644Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="52.720934ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="2a65317a-6c0d-4cb8-9550-04ca9cfd5d0a" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="52.044796ms" resp=200
INFO 2024-06-27T16:26:37.229913Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2174400&watch=true" latency="1.275025ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="16b9589b-636a-4ce4-9ffb-8693c6ab20d0" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="734.606µs" apf_execution_time="736.272µs" resp=0
INFO 2024-06-27T16:36:37.960481Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="64.554862ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="02353768-038e-4011-ba95-be3b75b7c768" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="63.743979ms" resp=200
INFO 2024-06-27T16:36:38.226185Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2181052&watch=true" latency="1.345608ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="9bc33fd4-7352-4b13-aa83-c8240da8948c" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="564.697µs" apf_execution_time="566.634µs" resp=0
INFO 2024-06-27T16:46:38.954864Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="58.856642ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="140aaf47-e307-4eaa-b446-098f43f6d292" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="58.256763ms" resp=200
INFO 2024-06-27T16:46:39.237466Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2187695&watch=true" latency="1.368998ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="f61a7719-25c7-4fa5-887c-aadd5cd23b02" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="661.437µs" apf_execution_time="663.141µs" resp=0

argoproj / argo-cd

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838