Closed stefanprodan closed 2 years ago
This coincided with outages at AWS in us-west-2 where I'm running Kubernetes on EKS. My entire cluster went down on that day for about 10 minutes and when it came back up I ran into this issue.
Hi, While waiting for permanent fix, are there any advice how to detect "stucked" image-automation-controller case? Are there any metrics or logs we should monitor? We have been suffering the issue (~couple of times per week) for some time and the only alert is from users who pushed their images to container registry and didn't notice cluster deployment for quite some time.
Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller.
@bondido how about something like this:
kubectl get pod -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-96dc99bfd-p9g4q 1/1 Running 0 9d
image-automation-controller-64c955c59-ckbft 1/1 Running 0 9d <<< ------
image-reflector-controller-55fb7f765d-cr8zn 1/1 Running 0 9d
kustomize-controller-7bc878f768-xf2xb 1/1 Running 0 9d
notification-controller-945795558-q8nht 1/1 Running 0 9d
source-controller-65665fd68f-n7qqz 1/1 Running 0 9d
@jwerre I'm afraid it won't help, as long running controller is no problem. The problem is that it's running but not doing it's job and not reflecting it in any log or metric (I still hope I missed something here :-) )
Ah yes.. I had the same problem you need to restart the pod e.g.:
kubectl delete pods image-automation-controller-64c955c59-ckbft -n flux-system
Sure @jwerre , I know. Thanks :-) As I mentioned earlier, sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.
As temporary kind of "automatic mitigation" we introduced a scheduled external script checking .status.lastAutomationRunTime of imageupdateautomation resource ( https://github.com/fluxcd/image-automation-controller/blob/041018fb00f853263dc7aac5f9dd7a18d058e0f0/config/crd/bases/image.toolkit.fluxcd.io_imageupdateautomations.yaml#L170 ) and invoking 'kubectl pod delete' commands if it seems to stuck for too long.
We'd love to see this bug fixed soon, anyway :-)
@bondido, I misunderstood your question. Sorry for the confusion.
sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.
@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?
@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?
We have to restart source-controller for image-automation-controller to start working. In fact we restart both - first image-automation-controller, and if we don't see any improvement in a couple of minutes time - source-controller.
So far, we haven't try to restart just source-controller.
@bondido Thanks for elaborating! On the face of it, I wouldn't expect restarting source-controller to have any effect on image-automation-controller, because it works independently: it only coincidentally refers to the same GitRepository objects, and doesn't alter anything at the upstream git repository (that might "unlock" image-automation-controller). Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?
@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp
and du -sh /tmp/*
. Thanks!
Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?
I can't be 100% sure as I couldn't get to any logs or metric confirming what was actually happening.
At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart. But then - on the third occurence - we restarted image-automation-controller and nothing happened for over 10 minutes. So it was just blind shot by my colleague to restart source-controller, after which everything started working.
The situation repeated exactly like the above for one more time.
@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running
ls -lah /tmp
anddu -sh /tmp/*
. Thanks!
You mean checking the directory contents inside "stucked" controller pod, don't you? (in "working OK" pod I've checked it and it's empty).
@jwerre can you please exec into the controller pod and see if there is anything left in tmp by running
ls -lah /tmp
anddu -sh /tmp/*
. Thanks!
ls -lah /tmp
total 0
drwxrwsrwx 2 root 1337 6 Jan 6 15:22 .
drwxr-xr-x 1 root root 17 Jan 4 16:20 ..
du -sh /tmp/*
du: cannot access '/tmp/AlTest1.err': No such file or directory
du: cannot access '/tmp/AlTest1.out': No such file or directory
du: cannot access '/tmp/adobegc.log': No such file or directory
du: cannot access '/tmp/com.apple.launchd.yGWoPxlEXv': No such file or directory
du: cannot access '/tmp/com.google.Keystone': No such file or directory
du: cannot access '/tmp/fseventsd-uuid': No such file or directory
du: cannot access '/tmp/powerlog': No such file or directory
command terminated with exit code 1
I should mention that I haven't had any problems since I restated the controller pod the first time.
Hello,
can you please exec into the controller pod and see if there is anything left in tmp by running `ls -lah /tmp
In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation.
And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation.
I've gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn't be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election)
With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn't able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.)
If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it.
@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running
ls -lah /tmp
anddu -sh /tmp/*
. Thanks!
After over 16 days since last problems, controller has just "stuck" on one of our clusters. The output of above commands run on the stucked pod is:
total 12K
drwxrwsrwx 3 root 1337 4.0K Jan 20 09:49 .
drwxr-xr-x 1 root root 4.0K Jan 3 09:24 ..
drwx--S--- 6 controller 1337 4.0K Jan 20 09:49 flux-system-flux-system56439449
3.0M /tmp
An other example, on a cluster with 16 differents ressources ImageUpdateAutomation, I have 3 of them "stuck".
controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ ls -alh
total 20K
drwxrwsrwx 5 root 1337 4.0K Jan 21 12:37 .
drwxr-xr-x 1 root root 4.0K Jan 17 09:21 ..
drwx--S--- 3 controller 1337 4.0K Jan 20 08:55 xxxx-xxxx-flux1428284464
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 xxxx-xxxx-flux3351794846
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 yyyy-yyyy-flux2915045790
controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ du -sh /tmp/*
2.0M /tmp/xxxx-xxxx-flux1428284464
696K /tmp/xxxx-xxxx-flux3351794846
3.5M /tmp/yyyy-yyyy-flux2915045790
Can this image from #297 resolve this issue ?
Because I sometimes have errors like Unable to clone: Connection timed out
or Unable to clone: failed to start SSH session: Unable to exchange encryption keys
@ahisette yes the libgit2 timeout callback could be the reason, please try out the image from #297 and see if it problem goes away.
Hello, I put new image on 2 of 4 clusters, after one week of testing. I have no good news : One freeze for one cluster with new image and also one freeze for one "old" image. With the same symptoms.
nobody@image-automation-controller-84bfbc78dc-89bqv:/$ ls -alh /tmp
total 12K
drwxrwsrwx 3 root 1337 4.0K Feb 1 08:54 .
drwxr-xr-x 1 root root 4.0K Jan 24 13:25 ..
drwx--S--- 8 nobody 1337 4.0K Feb 1 07:13 xxxx-xxxx-flux2375737287
nobody@image-automation-controller-84bfbc78dc-89bqv:/$ du -sh /tmp/*
5.9M /tmp/xxxx-xxxx-flux2375737287
With the release of Flux v0.26.2
, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process around libgit2
for the source-controller and image-automation-controller, we have observed some of the issues as described to have vanished (and confirmed by others as per https://github.com/fluxcd/source-controller/issues/439#issuecomment-1030812206).
@hiddeco Installed yesterday the new flux and today pushed an image, it seems to also work for me now.
I managed to reproduce this locally. I am running against the latest IAC version v0.20.1
, with a single SSH github repository using libgit2 with 1 minute intervals and the same for timeouts.
Just to help further investigations I will relay some of my observations/assumptions here as I progress.
By analysing the pprof endpoint, I noticed that the time difference in minutes between last IAC reconciliation log message, seems to match the running time of the thread/goroutine below. Leading me to think that libgit2 git_remote_fetch
is hanging. This was previously reported upstream and fixed a few years back. However, the issues may be correlated.
goroutine 332 [syscall, 4209 minutes, locked to thread]:
github.com/libgit2/git2go/v33._Cfunc_git_remote_fetch(0x7f37efdf97c0, 0xc0006e9290, 0xc00073cd00, 0x0)
_cgo_gotypes.go:6954 +0x4c
github.com/libgit2/git2go/v33.(*Remote).Fetch.func2(0xc00073cd00, 0xc00010fe60, 0x40e414, 0xab)
github.com/libgit2/git2go/v33@v33.0.7/remote.go:1044 +0xa7
github.com/libgit2/git2go/v33.(*Remote).Fetch(0xc00056e720, {0xc000f98ec0, 0xc001331e60, 0x1}, 0x2, {0x0, 0x50b8a6})
github.com/libgit2/git2go/v33@v33.0.7/remote.go:1044 +0x1e9
github.com/fluxcd/image-automation-controller/controllers.fetch({0x1fc1c40, 0xc00056e600}, {0xc0013c6690, 0x25}, {0xc000aac370, 0x4}, {0xc001331e60, {0xc000dbef80, 0xc000676200}})
github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:743 +0x32e
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc0003f2680, {0x1fc1c78, 0xc0012cad80}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x30}}})
github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:270 +0x1b15
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00010c000, {0x1fc1c78, 0xc0012cacc0}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x414fb4}}})
sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00010c000, {0x1fc1bd0, 0xc0003f2540}, {0x1bb1ea0, 0xc0011343a0})
sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00010c000, {0x1fc1bd0, 0xc0003f2540})
sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357
The container is still operational (I can exec into it) and other goroutines seem to be working as expected.
For an automatic restart, users could leverage the workqueue_longest_running_processor_seconds
metric whilst we try to get to the bottom of the issue. Here's a dump of some of the metrics for my hung IAC:
workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 256535.347854138
# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested
# TYPE workqueue_queue_duration_seconds histogram
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-08"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-07"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-06"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-06"} 4
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-05"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.001"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.01"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.1"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1"} 84
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="10"} 86
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="+Inf"} 91
workqueue_queue_duration_seconds_sum{name="imageupdateautomation"} 1000.9602119180001
workqueue_queue_duration_seconds_count{name="imageupdateautomation"} 91
The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.
The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT
to true
in the controller's Deployment. Once this feature has been tested extensively it may later become enabled by default.
Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via flux bootstrap
using the flux cli version v0.28.0
which will be released tomorrow.
It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists.
Hi @pjbgf, thank you for the update, I deployed v0.21.0
as you were posting your message, and unfortunately it seems to persist.
Sample metrics:
workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 3302.405028565
workqueue_queue_duration_seconds_bucket{le="+Inf", name="imageupdateautomation"} 3
EXPERIMENTAL_GIT_TRANSPORT
is set:
$ kubectl --namespace=flux-system exec -ti image-automation-controller-7995f48c77-g99qd -- \
printenv EXPERIMENTAL_GIT_TRANSPORT
true
Version:
$ kubectl --namespace=flux-system get pod image-automation-controller-7995f48c77-g99qd \
--output=jsonpath='{.spec.containers[?(@.name=="manager")].image}'
ghcr.io/fluxcd/image-automation-controller:v0.21.0
Nothing specific in the logs. Our interval is 1m
and it does commit on restarts.
The latest working version for us is still v0.19.0
.
@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please?
Here is the output of /debug/pprof/goroutine?debug=2
: https://gist.github.com/maxbrunet/8e304510d506e83be6ea2bb8a78ab42b
@maxbrunet thank you again for testing and providing the details so promptly. We have since fixed a few more issues in the new transport and released a new minor version. Would you be able to give it another try and confirm whether that fixes the problems you were experiencing?
Here's more information on how to test: https://github.com/fluxcd/source-controller/issues/636#issuecomment-1080789920
Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2
and source-controller v0.22.4
, but they both panic shortly after startup:
FYI I had not realized I needed to change spec.gitImplementation
to libgit2
before.
@maxbrunet we have made some improvements that may fix the issue you are experiencing. I have a release candidate for source-controller (below) that is based on a pending PR.
Can you please test the image below and let us know whether that fixes your problem?
ghcr.io/fluxcd/source-controller:rc-6d517589
Hey @pjbgf, no, sorry, I used Flux with my previous employer, and I am not working with it at the moment
@maxbrunet no worries, thank you for all the help so far.
This should be fixed as part of the managed transport improvements and the enforcement of context timeouts. Release candidates for both controllers:
ghcr.io/fluxcd/source-controller:rc-4b3e0f9a
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59
Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed.
Now with Managed Transport enforcing timeouts for Git operations, this should be resolved.
If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one.
Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller
v0.21.2
and source-controllerv0.22.4
, but they both panic shortly after startup:image-automation-controller - panic trace source-controller - panic trace GitRepository + Secret FYI I had not realized I needed to change
spec.gitImplementation
tolibgit2
before.
@maxbrunet how do you obtain these panic traces, I've portforwared pprof endpoints and cat get dubug-info at request. But how to get information when the process panics?
Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME
Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with
kubectl logs --previous POD_NAME
Ah, ok, thank you.
Reported here: https://github.com/fluxcd/flux2/discussions/2219
Having an automation that should reconcile every 7 minutes:
The reconciliation stoped two days ago for unknown reasons: