fluxd sync performance is degraded on Google Kubernetes Engine since the Kubernetes 1.20 upgrade

sa-spag commented 2 years ago

Describe the bug

As described on Slack, fluxd is much slower to synchronise the target Kubernetes cluster with its Git repository since it was upgraded from 1.19 to 1.20. According to my investigation, it appears that stages which became much slower is when fluxd relies the API server to discover resources annotations and perform garbage collection. Unfortunately, fluxd does not expose related metrics, so only logs timestamps enabled me to observe this behaviour.

Based on a couple of observations (though others give similar results), fluxd took 72s and 37s (= 109s in total) for annotations discovery and garbage collection respectively prior to the Kubernetes upgrade, and it now takes 124s and 51s (= 175s total) for the same steps, for >1.3K manifests.

We are using Google Kubernetes Engine (GKE): most Kubernetes components are not directly managed by us, including the API server. We opened request for the Google Cloud Platform Support but we did not get relevant responses so far. Our current clue is that the number of API resources (i.e. kubectl api-resources | wc -l) has bumped, which implies fluxd has more request to perform to discover the whole cluster.

Note that this may not be an issue with fluxd itself, but rather a flaw on GKE that amplifies fluxd weaknesses to scale seamlessly.

Steps to reproduce

Install fluxd on GKE 1.19
Observe sum(rate(flux_daemon_sync_duration_seconds_sum{}[1h])) by ()/sum(rate(flux_daemon_sync_duration_seconds_count{}[1h])) by (), i.e. the average time it takes for fluxd to complete a synchronisation operation
Upgrade to GKE 1.20
Notice the initially observed value increases

Expected behavior

The Kubernetes upgrade on GKE has little to no impact on fluxd performance.

Kubernetes version / Distro / Cloud provider

Google Kubernetes Engine, Kubernetes 1.20

Flux version

fluxd 1.24.1, chart 1.11.1

Git provider

No response

Container Registry provider

No response

Additional context

No response

Maintenance Acknowledgement

[X] I am aware of Flux v1's maintenance status

Code of Conduct

[X] I agree to follow this project's Code of Conduct

kingdonb commented 2 years ago

Thank you for reporting this issue. I take it from the thumbs-up reactions that others are validating this issue affects them as well. Very sorry for the trouble you're experiencing.

I'm not sure what causes the issue you are experiencing, but this is good information for the triage. Hopefully we can narrow it down a bit and if necessary / possible, a mitigation can be crafted that will improve or restore your original performance.

sa-spag commented 2 years ago

We had new insights on the issue and it appears the number of resources in our cluster is the culprit. We ended up having over 30k VolumeSnapshots and VolumeSnapshotContents on a single cluster. I am not completely sure if this actually related to the Kubernetes upgrade, though Kubernetes Volume Snapshots were promoted to GA. Since these resources are not managed by Flux, we safely excluded them from Flux's discovery using fluxd's --k8s-unsafe-exclude-resource flag.

Flux is now much faster to complete a synchronization operation: we can close this issue. 🎉

fluxcd / flux