argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.77k stars 5.08k forks source link

Application Controller's live state cache doesn't use watch cache when talking to Kubernetes #18838

Open tosi3k opened 1 week ago

tosi3k commented 1 week ago

Checklist:

Describe the bug

When maintaining a registered cluster's live state cache to track the state of K8s resources, the Application Controller is using a peculiar API call pattern that is poorly performing at scale, especially when there's a lot of resources of a particular kind.

At the moment, we can see here that for every API resource kind, we create a separate goroutine that:

This approach has a couple of problems:

To Reproduce

Follow steps 1-6 from the Getting Started to register any cluster in Argo CD in a default setup.

Observe logs of kube-apiserver to see periodic (every 10 minutes) LISTs of all resources issued directly to etcd (no resourceVersion parameter in the URI of logged API request) rather than to the kube-apiserver's watch cache (resourceVersion=0 in the URI string).

Expected behavior

argocd-application-controller's live state cache properly implements the List&Watch pattern (when tracking state of cluster resources) where it issues a LIST API call from the watch cache (i.e. with resourceVersion=0) and follows it with WATCH requests only (with increasing RV).

Screenshots

Version

argocd: v2.10.9+c071af8
  BuildDate: 2024-04-30T16:39:16Z
  GitCommit: c071af808170bfc39cbdf6b9be4d0212dd66db0c
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.10.9+c071af8

Logs

Logs from kube-apiserver for Pods from my small dev cluster I used for debugging:

INFO 2024-06-27T16:26:36.948644Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="52.720934ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="2a65317a-6c0d-4cb8-9550-04ca9cfd5d0a" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="52.044796ms" resp=200
INFO 2024-06-27T16:26:37.229913Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2174400&watch=true" latency="1.275025ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="16b9589b-636a-4ce4-9ffb-8693c6ab20d0" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="734.606µs" apf_execution_time="736.272µs" resp=0
INFO 2024-06-27T16:36:37.960481Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="64.554862ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="02353768-038e-4011-ba95-be3b75b7c768" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="63.743979ms" resp=200
INFO 2024-06-27T16:36:38.226185Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2181052&watch=true" latency="1.345608ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="9bc33fd4-7352-4b13-aa83-c8240da8948c" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="564.697µs" apf_execution_time="566.634µs" resp=0
INFO 2024-06-27T16:46:38.954864Z "HTTP" verb="LIST" URI="/api/v1/pods?limit=500" latency="58.856642ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="140aaf47-e307-4eaa-b446-098f43f6d292" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="58.256763ms" resp=200
INFO 2024-06-27T16:46:39.237466Z "HTTP" verb="WATCH" URI="/api/v1/pods?allowWatchBookmarks=true&resourceVersion=2187695&watch=true" latency="1.368998ms" userAgent="argocd-application-controller/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="f61a7719-25c7-4fa5-887c-aadd5cd23b02" srcIP="34.16.56.103:1634" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="661.437µs" apf_execution_time="663.141µs" resp=0
wojtek-t commented 1 week ago

/cc

tosi3k commented 1 day ago

I'll try crafting some solution this week.