Closed sa-spag closed 2 years ago
Thank you for reporting this issue. I take it from the thumbs-up reactions that others are validating this issue affects them as well. Very sorry for the trouble you're experiencing.
I'm not sure what causes the issue you are experiencing, but this is good information for the triage. Hopefully we can narrow it down a bit and if necessary / possible, a mitigation can be crafted that will improve or restore your original performance.
We had new insights on the issue and it appears the number of resources in our cluster is the culprit. We ended up having over 30k VolumeSnapshots and VolumeSnapshotContents on a single cluster. I am not completely sure if this actually related to the Kubernetes upgrade, though Kubernetes Volume Snapshots were promoted to GA. Since these resources are not managed by Flux, we safely excluded them from Flux's discovery using fluxd's --k8s-unsafe-exclude-resource
flag.
Flux is now much faster to complete a synchronization operation: we can close this issue. 🎉
Describe the bug
As described on Slack, fluxd is much slower to synchronise the target Kubernetes cluster with its Git repository since it was upgraded from 1.19 to 1.20. According to my investigation, it appears that stages which became much slower is when fluxd relies the API server to discover resources annotations and perform garbage collection. Unfortunately, fluxd does not expose related metrics, so only logs timestamps enabled me to observe this behaviour.
Based on a couple of observations (though others give similar results), fluxd took 72s and 37s (= 109s in total) for annotations discovery and garbage collection respectively prior to the Kubernetes upgrade, and it now takes 124s and 51s (= 175s total) for the same steps, for >1.3K manifests.
We are using Google Kubernetes Engine (GKE): most Kubernetes components are not directly managed by us, including the API server. We opened request for the Google Cloud Platform Support but we did not get relevant responses so far. Our current clue is that the number of API resources (i.e.
kubectl api-resources | wc -l
) has bumped, which implies fluxd has more request to perform to discover the whole cluster.Note that this may not be an issue with fluxd itself, but rather a flaw on GKE that amplifies fluxd weaknesses to scale seamlessly.
Steps to reproduce
sum(rate(flux_daemon_sync_duration_seconds_sum{}[1h])) by ()/sum(rate(flux_daemon_sync_duration_seconds_count{}[1h])) by ()
, i.e. the average time it takes for fluxd to complete a synchronisation operationExpected behavior
The Kubernetes upgrade on GKE has little to no impact on fluxd performance.
Kubernetes version / Distro / Cloud provider
Google Kubernetes Engine, Kubernetes 1.20
Flux version
fluxd 1.24.1, chart 1.11.1
Git provider
No response
Container Registry provider
No response
Additional context
No response
Maintenance Acknowledgement
Code of Conduct