Exporter can create performance problems at scale

moio commented 8 months ago

I am looking at a user case with ~1.4k one-node clusters managed by Rancher, and I see prometheus-rancher-exporter generating considerable Kubernetes API load, especially to retrieve cluster and node information.

Here is an excerpt of the 10 slowest API calls within 8 minutes:

RequestUri	Verb	UserAgent	ResponseStatus	Kubernetes API Time (seconds)
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	44.838
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	42.098
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	41.834
/apis/management.cattle.io/v3/clusterroletemplatebindings	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	40.35
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	38.722
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	38.708
/apis/management.cattle.io/v3/nodes	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	38.382
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	38.239
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	37.637
/apis/management.cattle.io/v3/clusters	list	prometheus-rancher-exporter/v0.0.0 (linux/amd64) kubernetes/$Format	{"metadata":{},"code":200}	37.626

All are due to prometheus-rancher-exporter (actually all the way down to the top ~250 in the sample I observed).

Unfortunately I do not know enough about the exporter's internals to suggest any solutions yet.

Anddd7 commented 5 months ago

When we set a short timer, a similar situation occurred (although our cluster size is not yet large)

What i find is

there are 14+4 goroutine starting every TIMER - means 18 api calls to rancher api server at the same time
a api call to rancher which managed huge number of cluster/nodes will return a large json doc.
simultaneously executing API requests and large JSON responses can slow down the rancher server and fill up network bandwidth -> which result to timeout request, but there is no timeout settings
so, some goroutines are still suspending, when next TIMER coming, it becomes worse.

https://github.com/David-VTUK/prometheus-rancher-exporter/blob/a61958c1bf96df96cefd4211e4ccdcdfd7026140/collector/collector.go#L260-L280

mattmattox commented 5 months ago

Could this be solved by using watch handlers to get the current state then the watchers that just update the local cache?

David-VTUK / prometheus-rancher-exporter

Exporter can create performance problems at scale #33