Kong / charts

Helm chart for Kong
Apache License 2.0
242 stars 474 forks source link

The failure of readyness probe leads to a crash of the proxy #979

Closed liverpool8056 closed 3 months ago

liverpool8056 commented 8 months ago

As outlined in https://github.com/Kong/charts/blob/main/charts/kong/values.yaml, both the liveness and readyness probes rely on status_listener of Kong. However, pods stop receiving the traffic through Kubernetes service if the readyness probe fails, which means that the liveness probe can't serve traffic after the readyness probe fails. Moreover, in the event of a database outage, the readyness probe will fail, potentially leading to the destruction of the Kong pod. Considering this, I am wondering whether the liveness probe should be modified to a different approach rather than a http probe given that the readiness probe typically does not result in application termination.

randmonkey commented 7 months ago

@liverpool8056 It is possible to change the liveness probe and readiness probe. You can see an example (default) configuration of the probes here:

You can update these settings in your values when you install Kong by helm.

rainest commented 7 months ago

Were you seeing an issue in practice? Can you provide replication steps?

Which versions are you using? The endpoints returning failure due to a database outage was a bug that affected older versions of Kong. Current versions should return successful responses (albeit with database.reachable=false) if the database is offline. If you're seeing failures on the latest Kong version, we'd want to investigate that as a bug.

While failing readiness does remove a replica from the list of Service endpoints, this should not affect the liveness check. Kubernetes' probe requests are issued on a per-replica basis directly from the kubelet. They do not pass through the Service, which instead dispatches a request to only one replica at random. Non-ready Pods should still receiving continuing readiness and liveness probes if they become unready.

liverpool8056 commented 7 months ago

Thanks @randmonkey for your workaround. Though it is an alternative, I think it would be better to provide users with a more reliable one natively.

@rainest Here is the FTI I worked on last month: https://konghq.atlassian.net/browse/FTI-5346, and the reproduce steps are available in it.

While failing readiness does remove a replica from the list of Service endpoints, this should not affect the liveness check.

This is true, the liveness and readyness probes are independent with each other. However in this case, the liveness probe is a http one. After being removed from the list of endpoints due to the failure of readyness, the pod actually is no longer ready for receiving any traffic, it seems the http liveness probe can't reach the pod either, thus it leads to the failure of liveness. That's what I observed from the FTI.

randmonkey commented 4 months ago

@liverpool8056 As @rainest has mentioned, liveness probes and readiness probes are called by kubelet to each pod, NOT found by endpoints. Also, liveness probe and readiness probe supports exec mode to execute a command in the container and tcp mode to test whether a TCP port is open: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/. You can also change the liveness probe to exec like this:

    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5