Open pmatheson-greenphire opened 3 years ago
The current workaround is to disable leader election by setting --enable-leader-election=false
in all controllers.
@pmatheson-greenphire can please try the image from #318 thanks
I have set replicas
to 2 for the source-controller and now the 2nd pod never becomes ready. In this redundant setup I can not disable leader election. The controller version is v0.16.1
.
The helm code has received some extensive updates in the recent releases and this may have had a performance impact in the direction of alleviating this issue, since one of the problems addressed was a tricky-to-reproduce memory leak condition that seems likely to have come with performance slowdowns.
Can you say if this issue is still around in Flux 0.23.0?
I don't know if source controller with multiple replicas will actually work, I think that only the leader is effectively working and others are "for backup" - in case the leader disappears. You can generally increase the --concurrency
parameter on Flux controllers, but this comes with a performance impact and likely also requires an increase in memory Limits. I'm interested in any progress you have made with this issue in the mean time.
I tried with Flux 0.23.0 and can not reproduce the issue.
Ok, now I see the readiness probes are still failing, so this issue is not fixed.
I don't think this is an issue, rather by-design. Since only one instance of source-controller will actively be reconciling (the leader), any other instances will by nature have to be un-ready. They are (should not be?) restarted continuously, since they are alive and waiting to be elected leader
but they are not Ready so that traffic will not be routed to them:
Perhaps I spoke too soon, in Source Controller, the Readiness and Liveness probes are the same:
Do your idle source-controller pods get killed off when the liveness probe fails? That is probably not by design.
Are you sure that running multiple source-controller pods is going to get you the desired result? I'm not certain if it has been made clear enough what the leader election limits and what you can expect to get from an increased number of source-controller replicas.
There is no "HA-mode" or "scaled up" source controller, when it comes to replicas count. The source controller is at present only able to be scaled vertically, (by increasing --concurrency
and adding more RAM and more CPU as needed, as would be determined by performance monitoring). I'm not sure if that's where you'll find a bottleneck though, unless the source controller is restarted more often than it is designed to do.
(Edit: I am overstating the rigidity of the design. As was explained before, you can also disable leader election, and then all the replicas should become ready. You haven't stated why you do not want to disable leader election, but that is the reason apparent why you have this issue @Legion2.)
I scale the source-controller horizontally to make it highly available in case of node failure. So leader election must be turned on. From my understanding, it is totally fine that only the leader is actively reconciling resources and the other replica is waiting to become the leader.
Totally fine, maybe. But I am questionable about whether this configuration is accurately described as HA.
The event of restarting the source-controller may require more specialized management than simply scaling up.
There is an emptyDir volume attached by default, and this will not be replicated between source-controller replicas. This will have to be regenerated when the new leader comes online. This could be considered a period of unavailability, even if the source controller service is technically operating. I don't think I would describe source controller scaled up as HA.
Another important use case is that during a rolling deployment upgrade there is always one ready replica which can apply changes from a source. For example, someone pushes a broken flux configuration to git and the source controller and kustomize controller apply this configuration, which causes the source controller deployment to upgrade and get stuck because the new pod can not be scheduled. In this situation fixing the flux configuration in git is not applied to the cluster, because there is no ready source-controller pod which can download the source. (source-controller uses the Recreate strategy)
We are also facing this issue with version 0.29.5 :(
Hey there,
I think I'm having a similar issue with my source-controller. I'm not that experienced in Kubernetes and Flux yet but I'm also getting a Readiness probe failed: Get ... connect: connection refused
error and I cant find the source of the problem. After a while the source-controller goes into state CrashLoopBackoff
. I can get rid of the issue temporarily using the params for the ReadinessProbe mentioned by @pmatheson-greenphire:
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
But after a while the issue returns. I'm not sure when it started but I dont think I had the issue when I first started using Flux.
Flux version is 0.36.0
.
So far all issues related to similar errors didnt help, I cant figure out what the cause is.
I'm thankful for any pointers!
I found myself in similar situation. For those who really need to fix this I found this hacky way.
Edit deployment for source-controller kubectl edit -n flux-system deployments.apps source-controller
and remove readinessProbe
section form the manifest. Save & exit. If necessary delete source-controller pod so it will be recreated without readiness probe.
Maybe this helps others: I had exactly the same problem. The root cause for me was an incomplete no_proxy
configuration. As of https://fluxcd.io/flux/cheatsheets/bootstrap/#using-https-proxy-for-egress-traffic the no_proxy
variable must also contain the IP address range of the pods.
Describe the bug
With 20-25 helm releases to reconcile the source-controller readinessProbe will start to fail and enter crashLoopBackOff state.
To Reproduce
Steps to reproduce the behaviour:
Make the source-controller busy by adding 20+ helm releases to reconcile. I'm sure the chart and the location of the repositories matter b/c some take longer than others.
Expected behavior
running source-controller shouldn't failed readinessProbe
Additional context
I was able to fix the issue by setting
Below please provide the output of the following commands:
other logs available upon request