Open 0xStarcat opened 1 month ago
The 2nd issue where a single ImageUpdateAutomation
stopped running happened again today, and we had to cycle the controller pod to resolve.
There is some strange connection between the 2 automations that stopped today and Tuesday. Both of them were automations in the same cluster that were updating the same image name / tag, but for 2 different apps. Tuesday had automation B
get stalled while today automation A
stalled.
After Tuesday's issue we decided to consolidate them down to 1 automation by increasing the update.path
to target any file in the cluster directory, and after merging that MR to remove automation B
the automation for the last remaining automation (automation A
) stopped 15 minutes later.
After more investigation, the image-automation-controller
was getting CPU throttled at the default 1
CPU and had been seeing a steady rise in this error: unable to clone 'ssh://[GIT_REPO_REDACTED].git': context deadline exceeded
, so we've taken steps to increase our CPU (up to 3 at this point) and are monitoring.
This however doesn't fully explain both issues:
ImageAutomationUpdate
resource, but continue with othersThe controller has a exponential backoff mechanism, same as Kubernetes has for image pull errors. Every time it fails, the backoff increase the retry interval and ca go up to hours. Is this repo an app repo? Does it have a huge .git history?
It's a dedicated flux repo, but there are a lot of configurations in there (4 main clusters, approx 30+ apps / image automations per cluster each with intervals of 1m
).
clusters/
|
cluster1/
| |
| namespace1
| |
| app1
|
etc...
I think this command gives me the size of the history:
$ git count-objects -vH
count: 4239
size: 16.57 MiB
in-pack: 294837
packs: 41
size-pack: 39.72 MiB
prune-packable: 1666
garbage: 0
size-garbage: 0 bytes
it clones pretty quickly on my laptop, like in just a couple seconds
Could networking be a potential cause?
The primary errors we're seeing from the image-controller (about 1-2 per minute) are:
unable to clone 'ssh://[REDACTED].git': context deadline exceeded
unable to clone 'ssh://[REDACTED].git': dial tcp [REDACTED]: connect: connection timed out
unable to clone 'ssh://[REDACTED].git': unexpected EOF
--
We're using Gitlab and it has SSH git operations limit of 600/min. Is the image-automation-controller
performing git pull
events? Or is it just checking against the source-controller
?
I see the image-automation-controller
log no changes made in working directory; no commit
at a rate of about 120/min, which is well under the gitlab rate limit anyway
The controller does a full clone every time, you may want to increase the timeout in the GitRepo.
I'll give it a shot and monitor over the next day or so. thanks!
So the errors are perhaps a red herring and what I'm seeing doesn't seem associated with the exponential backoff.
I turned on --log-level=debug
on the image-automation controller and observed another automation getting stuck (it was the only one in the cluster where "Last Run" was a few hours ago).
The final logs emitted for the automation (filtered by name) were:
Jun 03 19:00:03.386
{"msg":"attempting to clone git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug",...}
Jun 03 19:00:03.386
{"msg":"fetching git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug"...}
Jun 03 18:58:42.303
{"msg":"no changes made in working directory; no commit","controller":"imageupdateautomation","level":"info"...}
Jun 03 18:58:42.303
{"msg":"ran updates to working dir","controller":"imageupdateautomation","level":"debug"...}
Jun 03 18:58:42.303
{"msg":"updating with setters according to image policies","manifests-path":"/tmp/flux-system-flux-system572111690/clusters/mvn-954bfd3b-qa2/cap/core-services","controller":"imageupdateautomation","level":"debug"...}
Jun 03 18:58:33.291
{"msg":"attempting to clone git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug"...}
Jun 03 18:58:33.291
{"msg":"fetching git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug"...}
Jun 03 18:57:12.221
{"msg":"no changes made in working directory;...
So it's hanging after the debug log attempting to clone git repository
while no other ImageUpdateAutomation
resources are stuck. This appears to be happening to random automations, as its a different one this time.
There was an "exponential backoff" log message + context deadline error about 15 minutes before this, but the later logs show the resource recovering and continuing to run "no changes made..." until it goes silent
I've so far increased CPU on the image-automation-controller and increased the GitRepository timeout to 2m
, however single ImageAutomations are still getting stuck for hours even when new tags are detected by image-reflector. Is there any more guidance / clues you can offer for helping to debug this?
If more information is helpful, we had one 1 automation stall then resolve itself over the night:
Unfortunately this cluster didn't have the debug logs turned on, but interestingly the logs after it un-stalled were the context deadline error (11 hours later)
Jun 05 06:12:48.213
{"msg":"Reconciler error","controller":"imageupdateautomation","level":"error","namespace":"mls"...,"error":"unable to clone 'ssh://[REDACTED].git': context deadline exceeded",...}
Jun 05 06:12:48.212
{"msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"imageupdateautomation","level":"info","namespace":"mls",...}
Jun 04 19:12:43.175
{"msg":"no changes made in working directory; no commit","controller":"imageupdateautomation","level":"info","namespace":"mls",...}
That's strange because the timeout for the clone operation is derived from the GitRepository's .spec.timeout
field. What's the value of that field in your case?
Agree, it's well beyond the timeout. It was bumped up it to 2m0s
# kustomization.yaml
- patch: |
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: flux-system
spec:
timeout: 2m0s
For the time being, we've deployed a pod that will kill/cycle the image-automation-controller
pod when it detects that an ImageUpdateAutomation
hasn't been updated for 5 minutes (with the 2m timeout, they should be getting run at least every 2 minutes). Killing the controller pod un-stalls the automations reliably. It kills a few controller pods per day.
I'm wondering if this has a relationship to this issue: https://github.com/fluxcd/source-controller/issues/1154
After digging some more, I found the fluxcd.workqueue.unfinished_work
Datadog metric showing abnormal values right before the issues were noticeable:
This datadog metric is described as:
The number of seconds of work that has been done that is in progress and
hasn't been observed by work_duration. Large values indicate stuck threads.
We do have a fair amount of SSH errors in our logs cloning from Gitlab.com, which to me seems pretty expected from the SAAS provider. Stuck threads, if they cause these controllers to stall, seem like a problem originating in FluxCD.
Are there any mitigations you'd recommend that might reduce the impact of this on v2.2
?
Making a new issue after seeing this comment in the last issue on this topic: https://github.com/fluxcd/image-automation-controller/issues/282#issuecomment-1151022724
Specs
On Flux
v2.2
with the image-update-automation pod running the imageimage-automation-controller:v0.37.0
forimage.toolkit.fluxcd.io/v1beta1
automation resourcesDescription
The issue we saw was the
image-automation-controller
stopped reconciling for allImageUpdateAutomation
resources in the cluster for ~ 5 hours, then when it resumed stopped reconciling only 1ImageUpdateAutomation
resource out of many in the cluster.To resolve it, had to delete the
image-automation-controller
pod and upon restart everything worked fine.The following telemetry screenshots show CPU usage dropping off entirely, log volume dropping off entirely, and no errors produced in logs. There were no pod restarts / crashloops or any k8s events for the pods.
-
I've attached a
debug/pprof/goroutine?debug=2"
profiler to this issue as well.heap.txt
Please let me know if there's any further details I can provide. Thank you!