David-VTUK / prometheus-rancher-exporter

MIT License
7 stars 7 forks source link

Exporter can create performance problems at scale #33

Open moio opened 8 months ago

moio commented 8 months ago

I am looking at a user case with ~1.4k one-node clusters managed by Rancher, and I see prometheus-rancher-exporter generating considerable Kubernetes API load, especially to retrieve cluster and node information.

Here is an excerpt of the 10 slowest API calls within 8 minutes:

RequestUri Verb UserAgent ResponseStatus Kubernetes API Time (seconds)
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 44.838
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 42.098
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 41.834
/apis/management.cattle.io/v3/clusterroletemplatebindings list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 40.35
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 38.722
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 38.708
/apis/management.cattle.io/v3/nodes list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 38.382
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 38.239
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 37.637
/apis/management.cattle.io/v3/clusters list prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format {"metadata":{},"code":200} 37.626

All are due to prometheus-rancher-exporter (actually all the way down to the top ~250 in the sample I observed).

Unfortunately I do not know enough about the exporter's internals to suggest any solutions yet.

Anddd7 commented 5 months ago

When we set a short timer, a similar situation occurred (although our cluster size is not yet large)

What i find is

https://github.com/David-VTUK/prometheus-rancher-exporter/blob/a61958c1bf96df96cefd4211e4ccdcdfd7026140/collector/collector.go#L260-L280

mattmattox commented 5 months ago

Could this be solved by using watch handlers to get the current state then the watchers that just update the local cache?