Controller stops reconciling, needs restart

stefanprodan commented 2 years ago

Reported here: https://github.com/fluxcd/flux2/discussions/2219

Having an automation that should reconcile every 7 minutes:

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: flux-system
  namespace: flux-system
spec:
  git:
    checkout:
      ref:
        branch: master
    commit:
      author:
        email: me@example.com
        name: me
      messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
    push:
      branch: master
  interval: 7m0s
  sourceRef:
    kind: GitRepository
    name: flux-system
  update:
    path: ./staging
    strategy: Setters

The reconciliation stoped two days ago for unknown reasons:

$ date
Fri Dec 17 16:25:48 EET 2021

$ flux get image update

NAME        READY   MESSAGE                                                         LAST RUN                    SUSPENDED
flux-system True    no updates made; last commit 8574614 at 2021-12-14T22:47:08Z    2021-12-15T08:15:01-07:00   False

jwerre commented 2 years ago

This coincided with outages at AWS in us-west-2 where I'm running Kubernetes on EKS. My entire cluster went down on that day for about 10 minutes and when it came back up I ran into this issue.

bondido commented 2 years ago

Hi, While waiting for permanent fix, are there any advice how to detect "stucked" image-automation-controller case? Are there any metrics or logs we should monitor? We have been suffering the issue (~couple of times per week) for some time and the only alert is from users who pushed their images to container registry and didn't notice cluster deployment for quite some time.

Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller.

jwerre commented 2 years ago

@bondido how about something like this:

kubectl get pod -n flux-system

NAME                                          READY   STATUS    RESTARTS   AGE
helm-controller-96dc99bfd-p9g4q               1/1     Running   0          9d
image-automation-controller-64c955c59-ckbft   1/1     Running   0          9d  <<< ------
image-reflector-controller-55fb7f765d-cr8zn   1/1     Running   0          9d
kustomize-controller-7bc878f768-xf2xb         1/1     Running   0          9d
notification-controller-945795558-q8nht       1/1     Running   0          9d
source-controller-65665fd68f-n7qqz            1/1     Running   0          9d

bondido commented 2 years ago

@jwerre I'm afraid it won't help, as long running controller is no problem. The problem is that it's running but not doing it's job and not reflecting it in any log or metric (I still hope I missed something here :-) )

jwerre commented 2 years ago

Ah yes.. I had the same problem you need to restart the pod e.g.:

kubectl delete pods image-automation-controller-64c955c59-ckbft -n flux-system

bondido commented 2 years ago

Sure @jwerre , I know. Thanks :-) As I mentioned earlier, sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.

As temporary kind of "automatic mitigation" we introduced a scheduled external script checking .status.lastAutomationRunTime of imageupdateautomation resource ( https://github.com/fluxcd/image-automation-controller/blob/041018fb00f853263dc7aac5f9dd7a18d058e0f0/config/crd/bases/image.toolkit.fluxcd.io_imageupdateautomations.yaml#L170 ) and invoking 'kubectl pod delete' commands if it seems to stuck for too long.

We'd love to see this bug fixed soon, anyway :-)

jwerre commented 2 years ago

@bondido, I misunderstood your question. Sorry for the confusion.

squaremo commented 2 years ago

sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.

@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?

bondido commented 2 years ago

@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?

We have to restart source-controller for image-automation-controller to start working. In fact we restart both - first image-automation-controller, and if we don't see any improvement in a couple of minutes time - source-controller.

So far, we haven't try to restart just source-controller.

squaremo commented 2 years ago

@bondido Thanks for elaborating! On the face of it, I wouldn't expect restarting source-controller to have any effect on image-automation-controller, because it works independently: it only coincidentally refers to the same GitRepository objects, and doesn't alter anything at the upstream git repository (that might "unlock" image-automation-controller). Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?

stefanprodan commented 2 years ago

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

bondido commented 2 years ago

Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?

I can't be 100% sure as I couldn't get to any logs or metric confirming what was actually happening.

At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart. But then - on the third occurence - we restarted image-automation-controller and nothing happened for over 10 minutes. So it was just blind shot by my colleague to restart source-controller, after which everything started working.

The situation repeated exactly like the above for one more time.

bondido commented 2 years ago

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

You mean checking the directory contents inside "stucked" controller pod, don't you? (in "working OK" pod I've checked it and it's empty).

jwerre commented 2 years ago

@jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

ls -lah /tmp
total 0
drwxrwsrwx 2 root 1337  6 Jan  6 15:22 .
drwxr-xr-x 1 root root 17 Jan  4 16:20 ..

du -sh /tmp/*
du: cannot access '/tmp/AlTest1.err': No such file or directory
du: cannot access '/tmp/AlTest1.out': No such file or directory
du: cannot access '/tmp/adobegc.log': No such file or directory
du: cannot access '/tmp/com.apple.launchd.yGWoPxlEXv': No such file or directory
du: cannot access '/tmp/com.google.Keystone': No such file or directory
du: cannot access '/tmp/fseventsd-uuid': No such file or directory
du: cannot access '/tmp/powerlog': No such file or directory
command terminated with exit code 1

jwerre commented 2 years ago

I should mention that I haven't had any problems since I restated the controller pod the first time.

ahisette commented 2 years ago

Hello,

can you please exec into the controller pod and see if there is anything left in tmp by running `ls -lah /tmp

In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation.

And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation.

kingdonb commented 2 years ago

I've gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn't be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election)

With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn't able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.)

If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it.

bondido commented 2 years ago

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

After over 16 days since last problems, controller has just "stuck" on one of our clusters. The output of above commands run on the stucked pod is:

total 12K
drwxrwsrwx 3 root       1337 4.0K Jan 20 09:49 .
drwxr-xr-x 1 root       root 4.0K Jan  3 09:24 ..
drwx--S--- 6 controller 1337 4.0K Jan 20 09:49 flux-system-flux-system56439449

3.0M    /tmp

ahisette commented 2 years ago

An other example, on a cluster with 16 differents ressources ImageUpdateAutomation, I have 3 of them "stuck".

controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ ls -alh
total 20K
drwxrwsrwx 5 root       1337 4.0K Jan 21 12:37 .
drwxr-xr-x 1 root       root 4.0K Jan 17 09:21 ..
drwx--S--- 3 controller 1337 4.0K Jan 20 08:55 xxxx-xxxx-flux1428284464
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 xxxx-xxxx-flux3351794846
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 yyyy-yyyy-flux2915045790

controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ du -sh /tmp/*
2.0M    /tmp/xxxx-xxxx-flux1428284464
696K    /tmp/xxxx-xxxx-flux3351794846
3.5M    /tmp/yyyy-yyyy-flux2915045790

Can this image from #297 resolve this issue ? Because I sometimes have errors like Unable to clone: Connection timed out or Unable to clone: failed to start SSH session: Unable to exchange encryption keys

stefanprodan commented 2 years ago

@ahisette yes the libgit2 timeout callback could be the reason, please try out the image from #297 and see if it problem goes away.

ahisette commented 2 years ago

Hello, I put new image on 2 of 4 clusters, after one week of testing. I have no good news : One freeze for one cluster with new image and also one freeze for one "old" image. With the same symptoms.

 nobody@image-automation-controller-84bfbc78dc-89bqv:/$ ls -alh /tmp
total 12K
drwxrwsrwx 3 root   1337 4.0K Feb  1 08:54 .
drwxr-xr-x 1 root   root 4.0K Jan 24 13:25 ..
drwx--S--- 8 nobody 1337 4.0K Feb  1 07:13 xxxx-xxxx-flux2375737287

 nobody@image-automation-controller-84bfbc78dc-89bqv:/$ du -sh /tmp/*
5.9M    /tmp/xxxx-xxxx-flux2375737287

hiddeco commented 2 years ago

With the release of Flux v0.26.2, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process around libgit2 for the source-controller and image-automation-controller, we have observed some of the issues as described to have vanished (and confirmed by others as per https://github.com/fluxcd/source-controller/issues/439#issuecomment-1030812206).

raress96 commented 2 years ago

@hiddeco Installed yesterday the new flux and today pushed an image, it seems to also work for me now.

pjbgf commented 2 years ago

I managed to reproduce this locally. I am running against the latest IAC version v0.20.1, with a single SSH github repository using libgit2 with 1 minute intervals and the same for timeouts.

Just to help further investigations I will relay some of my observations/assumptions here as I progress.

By analysing the pprof endpoint, I noticed that the time difference in minutes between last IAC reconciliation log message, seems to match the running time of the thread/goroutine below. Leading me to think that libgit2 git_remote_fetch is hanging. This was previously reported upstream and fixed a few years back. However, the issues may be correlated.

goroutine 332 [syscall, 4209 minutes, locked to thread]:
github.com/libgit2/git2go/v33._Cfunc_git_remote_fetch(0x7f37efdf97c0, 0xc0006e9290, 0xc00073cd00, 0x0)
    _cgo_gotypes.go:6954 +0x4c
github.com/libgit2/git2go/v33.(*Remote).Fetch.func2(0xc00073cd00, 0xc00010fe60, 0x40e414, 0xab)
    github.com/libgit2/git2go/v33@v33.0.7/remote.go:1044 +0xa7
github.com/libgit2/git2go/v33.(*Remote).Fetch(0xc00056e720, {0xc000f98ec0, 0xc001331e60, 0x1}, 0x2, {0x0, 0x50b8a6})
    github.com/libgit2/git2go/v33@v33.0.7/remote.go:1044 +0x1e9
github.com/fluxcd/image-automation-controller/controllers.fetch({0x1fc1c40, 0xc00056e600}, {0xc0013c6690, 0x25}, {0xc000aac370, 0x4}, {0xc001331e60, {0xc000dbef80, 0xc000676200}})
    github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:743 +0x32e
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc0003f2680, {0x1fc1c78, 0xc0012cad80}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x30}}})
    github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:270 +0x1b15
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00010c000, {0x1fc1c78, 0xc0012cacc0}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x414fb4}}})
    sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00010c000, {0x1fc1bd0, 0xc0003f2540}, {0x1bb1ea0, 0xc0011343a0})
    sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00010c000, {0x1fc1bd0, 0xc0003f2540})
    sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
    sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357

The container is still operational (I can exec into it) and other goroutines seem to be working as expected.

For an automatic restart, users could leverage the workqueue_longest_running_processor_seconds metric whilst we try to get to the bottom of the issue. Here's a dump of some of the metrics for my hung IAC:

workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 256535.347854138
# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested
# TYPE workqueue_queue_duration_seconds histogram
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-08"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-07"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-06"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-06"} 4
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-05"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.001"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.01"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.1"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1"} 84
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="10"} 86
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="+Inf"} 91
workqueue_queue_duration_seconds_sum{name="imageupdateautomation"} 1000.9602119180001
workqueue_queue_duration_seconds_count{name="imageupdateautomation"} 91

pjbgf commented 2 years ago

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller's Deployment. Once this feature has been tested extensively it may later become enabled by default.

Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists.

maxbrunet commented 2 years ago

Hi @pjbgf, thank you for the update, I deployed v0.21.0 as you were posting your message, and unfortunately it seems to persist.

Sample metrics:

workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 3302.405028565
workqueue_queue_duration_seconds_bucket{le="+Inf", name="imageupdateautomation"} 3

EXPERIMENTAL_GIT_TRANSPORT is set:

$ kubectl --namespace=flux-system exec -ti image-automation-controller-7995f48c77-g99qd -- \
  printenv EXPERIMENTAL_GIT_TRANSPORT
true

Version:

$ kubectl --namespace=flux-system get pod image-automation-controller-7995f48c77-g99qd \
  --output=jsonpath='{.spec.containers[?(@.name=="manager")].image}'
ghcr.io/fluxcd/image-automation-controller:v0.21.0

Nothing specific in the logs. Our interval is 1m and it does commit on restarts. The latest working version for us is still v0.19.0.

pjbgf commented 2 years ago

@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please?

maxbrunet commented 2 years ago

Here is the output of /debug/pprof/goroutine?debug=2: https://gist.github.com/maxbrunet/8e304510d506e83be6ea2bb8a78ab42b

pjbgf commented 2 years ago

@maxbrunet thank you again for testing and providing the details so promptly. We have since fixed a few more issues in the new transport and released a new minor version. Would you be able to give it another try and confirm whether that fixes the problems you were experiencing?

Here's more information on how to test: https://github.com/fluxcd/source-controller/issues/636#issuecomment-1080789920

maxbrunet commented 2 years ago

Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2 and source-controller v0.22.4, but they both panic shortly after startup:

image-automation-controller - panic trace

``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1471573] goroutine 520 [running]: github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc000623450) github.com/fluxcd/source-controller@v0.22.4/pkg/git/libgit2/managed/ssh.go:268 +0x93 github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404da6, 0xc0002f1a00) github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc000343120, 0x7ffa2c44e680, 0x7ffa2c42b520, 0xc00087f040) _cgo_gotypes.go:3244 +0x4c github.com/libgit2/git2go/v33.Clone.func3(0xc000d1d938, 0x6, 0xc0006c4c70, 0x14eefc5) github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91 github.com/libgit2/git2go/v33.Clone({0xc0009bb500, 0xc0008ab2c0}, {0xc0000479e0, 0x28}, 0xc0007639e0) github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc0006c4c10, {0x1fd7b40, 0xc0008ab2c0}, {0xc0000479e0, 0x28}, {0xc0009bb500, 0x3a}, 0x0) github.com/fluxcd/source-controller@v0.22.4/pkg/git/libgit2/checkout.go:64 +0x22d github.com/fluxcd/image-automation-controller/controllers.cloneInto({0x1fd7b40, 0xc0008ab2c0}, {0xc0009d93b0, {0xc0009bb500, 0x22}}, 0xc0008ab260, {0xc0000479e0, 0x28}) github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:547 +0x1a3 github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc000772140, {0x1fd7b78, 0xc0009d6e40}, {{{0xc000d1d960, 0x1ca56a0}, {0xc000d1d950, 0x30}}}) github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:283 +0x1805 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0001622c0, {0x1fd7b78, 0xc0009d6db0}, {{{0xc000d1d960, 0x1ca56a0}, {0xc000d1d950, 0x414fd4}}}) sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001622c0, {0x1fd7ad0, 0xc000772080}, {0x1bc41e0, 0xc000465940}) sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001622c0, {0x1fd7ad0, 0xc000772080}) sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357 ```

source-controller - panic trace

``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a7cf53] goroutine 529 [running]: github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc00036bc70) github.com/fluxcd/source-controller/pkg/git/libgit2/managed/ssh.go:268 +0x93 github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404e06, 0xc0009d9d40) github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc0003512d0, 0x7ff2fa7a30e0, 0x7ff2fa7a3120, 0xc0008ef380) _cgo_gotypes.go:3244 +0x4c github.com/libgit2/git2go/v33.Clone.func3(0xc000866e44, 0x6, 0xc00061aec0, 0x1b4db45) github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91 github.com/libgit2/git2go/v33.Clone({0xc000c7fd40, 0xc000304ae0}, {0xc0004b0800, 0x36}, 0xc0005657a0) github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc00061ae60, {0x27ca660, 0xc000304ae0}, {0xc0004b0800, 0x36}, {0xc000c7fd40, 0x3a}, 0x0) github.com/fluxcd/source-controller/pkg/git/libgit2/checkout.go:64 +0x22d github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcileSource(0xc00095f3b0, {0x27ca698, 0xc000771e90}, 0xc000876200, 0xc0009335f0, 0x18, {0xc0004b0800, 0x36}) github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:404 +0x99f github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcile(0x2834958, {0x27ca698, 0xc000771e90}, 0xc000876200, {0xc00110dbe8, 0x4, 0x32000040e494}) github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:244 +0x3d5 github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile(0xc00095f3b0, {0x27ca698, 0xc000771e90}, {{{0xc000450f40, 0x2384b60}, {0xc000450f20, 0x30}}}) github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:205 +0x4bb sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000118bb0, {0x27ca698, 0xc000771e00}, {{{0xc000450f40, 0x2384b60}, {0xc000450f20, 0x415034}}}) sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000118bb0, {0x27ca5f0, 0xc000a3bf00}, {0x2226280, 0xc000635340}) sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000118bb0, {0x27ca5f0, 0xc000a3bf00}) sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357 ```

GitRepository + Secret

```yaml apiVersion: source.toolkit.fluxcd.io/v1beta2 kind: GitRepository metadata: name: my-repo namespace: flux-system spec: gitImplementation: libgit2 interval: 1m0s ref: branch: master url: ssh://git@bitbucket.org/my-workspace/my-repo.git secretRef: name: flux-git-credentials ``` ```yaml apiVersion: v1 kind: Secret metadata: name: flux-git-credentials namespace: flux-system stringData: identity: | -----BEGIN OPENSSH PRIVATE KEY----- ... -----END OPENSSH PRIVATE KEY----- known_hosts: bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDX CaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPW c/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw== type: Opaque ```

FYI I had not realized I needed to change spec.gitImplementation to libgit2 before.

pjbgf commented 2 years ago

@maxbrunet we have made some improvements that may fix the issue you are experiencing. I have a release candidate for source-controller (below) that is based on a pending PR.

Can you please test the image below and let us know whether that fixes your problem? ghcr.io/fluxcd/source-controller:rc-6d517589

maxbrunet commented 2 years ago

Hey @pjbgf, no, sorry, I used Flux with my previous employer, and I am not working with it at the moment

pjbgf commented 2 years ago

@maxbrunet no worries, thank you for all the help so far.

pjbgf commented 2 years ago

This should be fixed as part of the managed transport improvements and the enforcement of context timeouts. Release candidates for both controllers:

ghcr.io/fluxcd/source-controller:rc-4b3e0f9a ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

pjbgf commented 2 years ago

Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed.

Now with Managed Transport enforcing timeouts for Git operations, this should be resolved.

If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one.

mickeahlinder commented 2 years ago

Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2 and source-controller v0.22.4, but they both panic shortly after startup:

image-automation-controller - panic trace source-controller - panic trace GitRepository + Secret FYI I had not realized I needed to change spec.gitImplementation to libgit2 before.

@maxbrunet how do you obtain these panic traces, I've portforwared pprof endpoints and cat get dubug-info at request. But how to get information when the process panics?

maxbrunet commented 2 years ago

Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME

mickeahlinder commented 2 years ago

Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME

Ah, ok, thank you.

fluxcd / image-automation-controller

Controller stops reconciling, needs restart #282