The readinessProbe for traefik pod (1 replica) is too sensitive

consideRatio commented 3 years ago

I have reasoned a lot about readinessProbe and livenessProbes for the JupyterHub Helm chart. There is one conclusion that I'd like to convey and suggest for this Helm chart.

For details about those past experiences, see https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1941.

My conclusion from previous experiences

All k8s Deployment resources that only have a single Pod replica, should if they have a readinessProbe at all, have the failureThreshold set to something very high so that the livenessProbe always triggers before the readinessProbe.

This is because it doesn't make sense for the single replica pod to become UnReady if there is no other pod that can pick up the slack.
A readinessProbe can still make sense to have if we care about being unready until we are successfully started. This is likely not relevant with only one replica as well.

The specific situation

I experienced a lot of network issues that I struggled to track down, but when I did I noticed that it was because the readinessProbe of the traefik pod were flaking from time to time. Whenever that happened, the traefik pod became UnReady and wouldn't receive traffic from the associated k8s Service. As there were only 1 replica of the traefik instance, it meant any traffic sent to the traefik k8s Service would fail.

The outcome was intermittent network issues. With a single replica traefik like I've used (the default), I suggest we increase the readinessProbe failureThreshold to 1000 or similar, or disabling it entirely. I'm not confident if the traefik instance supports running in multiple replicas or so, but if it doesn't that at least make sense.

Action point

[x] Clarify the ability or inability forthe api pod, traefik pod, and controller pod are able to run with multiple replicas.
- Controller is not HA, Traefik is HA, Gateway (api pod) is HA.
[ ] Suggested implementation: let all single replica deployments have a readinessProbe that is only enabled if more than one replica is used

consideRatio commented 3 years ago

Tech details

Traefik pod

Hardcoded readiness / liveness probes

https://github.com/dask/dask-gateway/blob/7d1659db8b2122d1a861ea820a459fd045fc3f02/resources/helm/dask-gateway/templates/traefik/deployment.yaml#L68-L85

Api pod (gateway)

Configurable readiness / liveness probes

https://github.com/dask/dask-gateway/blob/7d1659db8b2122d1a861ea820a459fd045fc3f02/resources/helm/dask-gateway/templates/gateway/deployment.yaml#L66-L86

Controller pod

No readiness / liveness probes

https://github.com/dask/dask-gateway/blob/7d1659db8b2122d1a861ea820a459fd045fc3f02/resources/helm/dask-gateway/templates/controller/deployment.yaml#L26-L70

droctothorpe commented 3 years ago

Clarify the ability or inability for the api pod, traefik pod, and controller pod are able to run with multiple replicas.

All three support HA. We deploy two of each in our cluster. The JH issue you linked is a direct result of hub's lack of HA support.

DG's approach to HA (which is a direct result of its ability to toggle between managing its own database and lettings its target backend handle state) is a profound step forward vis a vis JH and might be able to serve as a point of reference for how to achieve a similar outcome for JH. Here's the relevant PR: https://github.com/dask/dask-gateway/pull/195.

That being said, your suggested implementation could still be useful for people who don't want HA for some reason (to conserve cluster resources, for example).

consideRatio commented 3 years ago

Thanks for clarifying that all pods have HA support.

Do you know if the controller use some kind of leader election or similar?
I assume traefik will route traffic purely based on the k8s IngressRoute resources both can detect and it isn't strange it is HA.
I assume the gateway pod's state is external to the pod, within the DaskCluster k8s Resources as managed by the controller.

Given that traefik is truly HA, a sensitive readinessProbe can make sense, but I reason that it should perhaps default to having two replicas of that deployment so the default configuration isn't problematic and cause at least a 10 second disruption of all network traffic whenever a single /ping request fail.

Concrete suggestion ideas:

Let the failureThreshold be at least 2 for the readinessProbe, and possibly lower the interval from 10 to 5 as a compensation if it feels important to be as quickly UnReady as before.
Make readinessProbe for traefik and the controller configurable for both traefik / controller.

droctothorpe commented 3 years ago

Np! Jim did all the actual work on this so take my answers with a grain of salt.

Do you know if the controller use some kind of leader election or similar?

I just logged onto my work computer to double-check this and realized that we're only deploying one controller pod. I'm sorry for the confusion!

Not seeing anything in the controller logic about leader election (which is how Operator SDK and Kubebuilder handle this). Unless I'm missing something (which, clearly, wouldn't be a first, heh), that could be a great candidate for an upstream contribution. I'll file an issue to that effect. It would be something fun for my team to take a swing at if the core maintainers think it's a good idea.

At present, it looks like the controller doesn't leverage probes, which alleviates the concern about HA settings being applied to a singleton component.

I see the Traefik probe settings are hard-coded in the template. It looks like those are just inherited from the upstream.

Obviously, defaulting the Traefik and API Gateway replica counts to 2 in the values yaml is probably the path of least resistance, but I like your proposal for implementing logic to toggle between the two approaches.

dask / dask-gateway