`Readiness probe failed at startup` when reconciling many helm charts

pmatheson-greenphire commented 3 years ago

Describe the bug

With 20-25 helm releases to reconcile the source-controller readinessProbe will start to fail and enter crashLoopBackOff state.

To Reproduce

Steps to reproduce the behaviour:

Make the source-controller busy by adding 20+ helm releases to reconcile. I'm sure the chart and the location of the repositories matter b/c some take longer than others.

Expected behavior

running source-controller shouldn't failed readinessProbe

Additional context

I was able to fix the issue by setting

      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 10

Kubernetes version: 1.19
Git provider: github.com
Container registry provider: ECR

Below please provide the output of the following commands:

flux version 0.9.0
► checking prerequisites
✗ flux 0.9.0 <0.10.0 (new version is available, please upgrade)
✔ kubectl 1.20.4 >=1.18.0-0
✔ Kubernetes 1.17.12-eks-7684af >=1.16.0-0
► checking controllers
✔ helm-controller: healthy
► ghcr.io/fluxcd/helm-controller:v0.8.0
✔ kustomize-controller: healthy
► ghcr.io/fluxcd/kustomize-controller:v0.9.1
✔ notification-controller: healthy
► ghcr.io/fluxcd/notification-controller:v0.9.0
✔ source-controller: healthy
► ghcr.io/fluxcd/source-controller:v0.9.0
✔ all checks passed

pod/helm-controller-775f66d8f4-vqjgl          1/1     Running   0          23h
pod/kustomize-controller-5cb59f847c-qwqlb     1/1     Running   0          23h
pod/notification-controller-55dcddfc7-2qh9p   1/1     Running   0          23h
pod/source-controller-7f85d79d74-gdqvb        1/1     Running   2          15m

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/notification-controller   ClusterIP   172.20.4.105     <none>        80/TCP    20d
service/source-controller         ClusterIP   172.20.178.251   <none>        80/TCP    20d
service/webhook-receiver          ClusterIP   172.20.17.200    <none>        80/TCP    20d

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/helm-controller           1/1     1            1           20d
deployment.apps/kustomize-controller      1/1     1            1           20d
deployment.apps/notification-controller   1/1     1            1           20d
deployment.apps/source-controller         1/1     1            1           20d

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/helm-controller-775f66d8f4          1         1         1       20d
replicaset.apps/kustomize-controller-5cb59f847c     1         1         1       20d
replicaset.apps/notification-controller-55dcddfc7   1         1         1       20d
replicaset.apps/source-controller-7f85d79d74        1         1         1       15m
replicaset.apps/source-controller-85c64bc47b        0         0         0       20d

other logs available upon request

stefanprodan commented 3 years ago

The current workaround is to disable leader election by setting --enable-leader-election=false in all controllers.

stefanprodan commented 3 years ago

@pmatheson-greenphire can please try the image from #318 thanks

Legion2 commented 3 years ago

I have set replicas to 2 for the source-controller and now the 2nd pod never becomes ready. In this redundant setup I can not disable leader election. The controller version is v0.16.1.

kingdonb commented 2 years ago

The helm code has received some extensive updates in the recent releases and this may have had a performance impact in the direction of alleviating this issue, since one of the problems addressed was a tricky-to-reproduce memory leak condition that seems likely to have come with performance slowdowns.

Can you say if this issue is still around in Flux 0.23.0?

I don't know if source controller with multiple replicas will actually work, I think that only the leader is effectively working and others are "for backup" - in case the leader disappears. You can generally increase the --concurrency parameter on Flux controllers, but this comes with a performance impact and likely also requires an increase in memory Limits. I'm interested in any progress you have made with this issue in the mean time.

Legion2 commented 2 years ago

I tried with Flux 0.23.0 and can not reproduce the issue.

Legion2 commented 2 years ago

Ok, now I see the readiness probes are still failing, so this issue is not fixed.

yebyen commented 2 years ago

I don't think this is an issue, rather by-design. Since only one instance of source-controller will actively be reconciling (the leader), any other instances will by nature have to be un-ready. They are (should not be?) restarted continuously, since they are alive and waiting to be elected leader but they are not Ready so that traffic will not be routed to them:

https://github.com/fluxcd/source-controller/blob/44dd9d7e28820c2103f541737c3bf60cdc5ba926/main.go#L212-L219

yebyen commented 2 years ago

Perhaps I spoke too soon, in Source Controller, the Readiness and Liveness probes are the same:

https://github.com/fluxcd/source-controller/blob/08afc35b19872e813555459db615a39dc65811ae/controllers/testdata/charts/helmchartwithdeps/templates/deployment.yaml#L34-L41

Do your idle source-controller pods get killed off when the liveness probe fails? That is probably not by design.

Are you sure that running multiple source-controller pods is going to get you the desired result? I'm not certain if it has been made clear enough what the leader election limits and what you can expect to get from an increased number of source-controller replicas.

There is no "HA-mode" or "scaled up" source controller, when it comes to replicas count. The source controller is at present only able to be scaled vertically, (by increasing --concurrency and adding more RAM and more CPU as needed, as would be determined by performance monitoring). I'm not sure if that's where you'll find a bottleneck though, unless the source controller is restarted more often than it is designed to do.

(Edit: I am overstating the rigidity of the design. As was explained before, you can also disable leader election, and then all the replicas should become ready. You haven't stated why you do not want to disable leader election, but that is the reason apparent why you have this issue @Legion2.)

Legion2 commented 2 years ago

I scale the source-controller horizontally to make it highly available in case of node failure. So leader election must be turned on. From my understanding, it is totally fine that only the leader is actively reconciling resources and the other replica is waiting to become the leader.

yebyen commented 2 years ago

Totally fine, maybe. But I am questionable about whether this configuration is accurately described as HA.

The event of restarting the source-controller may require more specialized management than simply scaling up.

There is an emptyDir volume attached by default, and this will not be replicated between source-controller replicas. This will have to be regenerated when the new leader comes online. This could be considered a period of unavailability, even if the source controller service is technically operating. I don't think I would describe source controller scaled up as HA.

Legion2 commented 2 years ago

Another important use case is that during a rolling deployment upgrade there is always one ready replica which can apply changes from a source. For example, someone pushes a broken flux configuration to git and the source controller and kustomize controller apply this configuration, which causes the source controller deployment to upgrade and get stuck because the new pod can not be scheduled. In this situation fixing the flux configuration in git is not applied to the cluster, because there is no ready source-controller pod which can download the source. (source-controller uses the Recreate strategy)

thomasroot commented 2 years ago

We are also facing this issue with version 0.29.5 :(

patrickc-sb commented 1 year ago

Hey there, I think I'm having a similar issue with my source-controller. I'm not that experienced in Kubernetes and Flux yet but I'm also getting a Readiness probe failed: Get ... connect: connection refused error and I cant find the source of the problem. After a while the source-controller goes into state CrashLoopBackoff. I can get rid of the issue temporarily using the params for the ReadinessProbe mentioned by @pmatheson-greenphire:

      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 10

But after a while the issue returns. I'm not sure when it started but I dont think I had the issue when I first started using Flux. Flux version is 0.36.0. So far all issues related to similar errors didnt help, I cant figure out what the cause is. I'm thankful for any pointers!

sobi3ch commented 1 year ago

I found myself in similar situation. For those who really need to fix this I found this hacky way. Edit deployment for source-controller kubectl edit -n flux-system deployments.apps source-controller and remove readinessProbe section form the manifest. Save & exit. If necessary delete source-controller pod so it will be recreated without readiness probe.

florath commented 1 year ago

Maybe this helps others: I had exactly the same problem. The root cause for me was an incomplete no_proxy configuration. As of https://fluxcd.io/flux/cheatsheets/bootstrap/#using-https-proxy-for-egress-traffic the no_proxy variable must also contain the IP address range of the pods.

fluxcd / source-controller