Flux 2.2, image-automation-controller stops reconciling; needed restart

0xStarcat commented 1 month ago

Making a new issue after seeing this comment in the last issue on this topic: https://github.com/fluxcd/image-automation-controller/issues/282#issuecomment-1151022724

Specs

On Flux v2.2 with the image-update-automation pod running the image image-automation-controller:v0.37.0 for image.toolkit.fluxcd.io/v1beta1 automation resources

Description

The issue we saw was the image-automation-controller stopped reconciling for all ImageUpdateAutomation resources in the cluster for ~ 5 hours, then when it resumed stopped reconciling only 1 ImageUpdateAutomation resource out of many in the cluster.

To resolve it, had to delete the image-automation-controller pod and upon restart everything worked fine.

The following telemetry screenshots show CPU usage dropping off entirely, log volume dropping off entirely, and no errors produced in logs. There were no pod restarts / crashloops or any k8s events for the pods.

-

I've attached a debug/pprof/goroutine?debug=2" profiler to this issue as well.

heap.txt

Please let me know if there's any further details I can provide. Thank you!

0xStarcat commented 1 month ago

The 2nd issue where a single ImageUpdateAutomation stopped running happened again today, and we had to cycle the controller pod to resolve.

There is some strange connection between the 2 automations that stopped today and Tuesday. Both of them were automations in the same cluster that were updating the same image name / tag, but for 2 different apps. Tuesday had automation B get stalled while today automation A stalled.

After Tuesday's issue we decided to consolidate them down to 1 automation by increasing the update.path to target any file in the cluster directory, and after merging that MR to remove automation B the automation for the last remaining automation (automation A) stopped 15 minutes later.

0xStarcat commented 1 month ago

After more investigation, the image-automation-controller was getting CPU throttled at the default 1 CPU and had been seeing a steady rise in this error: unable to clone 'ssh://[GIT_REPO_REDACTED].git': context deadline exceeded, so we've taken steps to increase our CPU (up to 3 at this point) and are monitoring.

This however doesn't fully explain both issues:

what would cause the pod to "go silent"
what would cause it to stop reconciling a single ImageAutomationUpdate resource, but continue with others

stefanprodan commented 1 month ago

The controller has a exponential backoff mechanism, same as Kubernetes has for image pull errors. Every time it fails, the backoff increase the retry interval and ca go up to hours. Is this repo an app repo? Does it have a huge .git history?

0xStarcat commented 1 month ago

It's a dedicated flux repo, but there are a lot of configurations in there (4 main clusters, approx 30+ apps / image automations per cluster each with intervals of 1m).

clusters/
  |
  cluster1/
  |  |
  |  namespace1
  |    |
  |    app1
  |
  etc...

I think this command gives me the size of the history:

$ git count-objects -vH
count: 4239
size: 16.57 MiB
in-pack: 294837
packs: 41
size-pack: 39.72 MiB
prune-packable: 1666
garbage: 0
size-garbage: 0 bytes

it clones pretty quickly on my laptop, like in just a couple seconds

0xStarcat commented 1 month ago

Could networking be a potential cause?

The primary errors we're seeing from the image-controller (about 1-2 per minute) are:

unable to clone 'ssh://[REDACTED].git': context deadline exceeded

unable to clone 'ssh://[REDACTED].git': dial tcp [REDACTED]: connect: connection timed out

unable to clone 'ssh://[REDACTED].git': unexpected EOF

--

We're using Gitlab and it has SSH git operations limit of 600/min. Is the image-automation-controller performing git pull events? Or is it just checking against the source-controller?

I see the image-automation-controller log no changes made in working directory; no commit at a rate of about 120/min, which is well under the gitlab rate limit anyway

stefanprodan commented 1 month ago

The controller does a full clone every time, you may want to increase the timeout in the GitRepo.

0xStarcat commented 1 month ago

I'll give it a shot and monitor over the next day or so. thanks!

0xStarcat commented 1 month ago

So the errors are perhaps a red herring and what I'm seeing doesn't seem associated with the exponential backoff.

I turned on --log-level=debug on the image-automation controller and observed another automation getting stuck (it was the only one in the cluster where "Last Run" was a few hours ago).

The final logs emitted for the automation (filtered by name) were:

Jun 03 19:00:03.386
{"msg":"attempting to clone git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug",...}

Jun 03 19:00:03.386
{"msg":"fetching git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug"...}

Jun 03 18:58:42.303
{"msg":"no changes made in working directory; no commit","controller":"imageupdateautomation","level":"info"...}

Jun 03 18:58:42.303
{"msg":"ran updates to working dir","controller":"imageupdateautomation","level":"debug"...}

Jun 03 18:58:42.303
{"msg":"updating with setters according to image policies","manifests-path":"/tmp/flux-system-flux-system572111690/clusters/mvn-954bfd3b-qa2/cap/core-services","controller":"imageupdateautomation","level":"debug"...}

Jun 03 18:58:33.291
{"msg":"attempting to clone git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug"...}

Jun 03 18:58:33.291
{"msg":"fetching git repository","gitrepository":{"name":"flux-system","namespace":"flux-system"},"controller":"imageupdateautomation","level":"debug"...}

Jun 03 18:57:12.221
{"msg":"no changes made in working directory;...

So it's hanging after the debug log attempting to clone git repository while no other ImageUpdateAutomation resources are stuck. This appears to be happening to random automations, as its a different one this time.

There was an "exponential backoff" log message + context deadline error about 15 minutes before this, but the later logs show the resource recovering and continuing to run "no changes made..." until it goes silent

I've so far increased CPU on the image-automation-controller and increased the GitRepository timeout to 2m, however single ImageAutomations are still getting stuck for hours even when new tags are detected by image-reflector. Is there any more guidance / clues you can offer for helping to debug this?

0xStarcat commented 4 weeks ago

If more information is helpful, we had one 1 automation stall then resolve itself over the night:

Unfortunately this cluster didn't have the debug logs turned on, but interestingly the logs after it un-stalled were the context deadline error (11 hours later)


Jun 05 06:12:48.213
{"msg":"Reconciler error","controller":"imageupdateautomation","level":"error","namespace":"mls"...,"error":"unable to clone 'ssh://[REDACTED].git': context deadline exceeded",...}

Jun 05 06:12:48.212
{"msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"imageupdateautomation","level":"info","namespace":"mls",...}

Jun 04 19:12:43.175
{"msg":"no changes made in working directory; no commit","controller":"imageupdateautomation","level":"info","namespace":"mls",...}

makkes commented 4 weeks ago

That's strange because the timeout for the clone operation is derived from the GitRepository's .spec.timeout field. What's the value of that field in your case?

0xStarcat commented 4 weeks ago

Agree, it's well beyond the timeout. It was bumped up it to 2m0s

# kustomization.yaml
- patch: |
      apiVersion: source.toolkit.fluxcd.io/v1
      kind: GitRepository
      metadata:
        name: flux-system
      spec:
        timeout: 2m0s

0xStarcat commented 3 weeks ago

For the time being, we've deployed a pod that will kill/cycle the image-automation-controller pod when it detects that an ImageUpdateAutomation hasn't been updated for 5 minutes (with the 2m timeout, they should be getting run at least every 2 minutes). Killing the controller pod un-stalls the automations reliably. It kills a few controller pods per day.

0xStarcat commented 2 weeks ago

I'm wondering if this has a relationship to this issue: https://github.com/fluxcd/source-controller/issues/1154

After digging some more, I found the fluxcd.workqueue.unfinished_work Datadog metric showing abnormal values right before the issues were noticeable:

This datadog metric is described as:

The number of seconds of work that has been done that is in progress and 
hasn't been observed by work_duration. Large values indicate stuck threads.

We do have a fair amount of SSH errors in our logs cloning from Gitlab.com, which to me seems pretty expected from the SAAS provider. Stuck threads, if they cause these controllers to stall, seem like a problem originating in FluxCD.

Are there any mitigations you'd recommend that might reduce the impact of this on v2.2?

increase workers?
using a PVC?

fluxcd / image-automation-controller

Flux 2.2, image-automation-controller stops reconciling; needed restart #689

Specs

Description