implement liveness and readyness probes

StefanSchuhart commented 4 years ago

Although the FROST-Server pods in our cluster proved to be stable in the long-run, sometimes we face issues where a http or mqtt Pod becomes unresponsive but the container process was not terminated.

To detect unresponsive http/mqtt pods and kill them the helm chart should implement liveness und readyness probes for each.

phertweck commented 4 years ago

You're right the liveness and readyness probes are currently missing.

We will add them in the future. Anyhow pull requests are very welcome :-)

StefanSchuhart commented 4 years ago

After further investigating on this topic here and here its seems to me that implementing a liveness probe is as simple as adding

livenessProbe:
  httpGet:
    path: /
    port: somePort
  initialDelaySeconds: 60
  periodSeconds: 20
  failureThreshold: 1

to the container spec. But for the readyness probe there should be a distinct service path to check if a pod is busy.

Could you please confirm my findings?

phertweck commented 4 years ago

Thanks for your suggestion. Yes, the livenessProbe for the HTTP pod should look like this. Anyhow I suggest to change the check path to something like /v1.1/Things, since the root-Path can be delivered without access to the database and /v1.1/Things checks if the database connection is established correctly and the DB schema changes are run successfully. In my oppinion the liveness and readyness probe can check the same path. The probes for the MQTT pod might be a bit more complicated.

If we include the probes, we need to make sure that all the options are completly included in the Helm chart (which is not difficult, but time-consuming).

mjacoby commented 4 years ago

Maybe even use /v1.0/Things?$top=1 as /v1.0/Things might be expensive to load.

StefanSchuhart commented 4 years ago

In my opinion livenessProbes should only detect errors within the pod process alone, and not check for external services (like databases). What could be the benefit of restarting an otherwise functional pod if the database is currently not answering?

For the readynessProbe to check the Path /v1.0/Things?$top=1 could be sufficient.

phertweck commented 4 years ago

I agree that restarting the HTTP pod doesn't help if the external DB is not available. Anyhow there is no other way to check if the pod is fully functional other than checking its function (which is to deliver data from the DB). Also there might be cases where restarting a pod can solve DB connection issues.

For me there's an other argument to check the full functionality of the pod: if a pod is considered to be alive (based on the livenessProbe) it registered as backend for the Kubernetes service. A pod should receive requests only if it is able to handle them (which requires a working DB).

StefanSchuhart commented 3 years ago

I would like to add some considerations after running into an issue recently and after reading some more info about this topic and to bump up this issue.

One of our pods failed with the error below but was not killed, so it received requests that lead to timeouts on the client side:

Exception in thread "AsyncFileHandlerWriter-66233253"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "http-nio-8080-Acceptor-0"
Exception in thread "http-nio-8080-Acceptor-0" Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
10:10:30.279 [MQTT Rec: FROST-MQTT-Bus-4f2c09e6-17f2-48d1-bde8-c261d5c22fd8] WARN      d.f.i.i.f.m.MqttMessageBus - Connection to message bus lost.
Exception in thread "http-nio-8080-exec-1" java.lang.OutOfMemoryError: Java heap space

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "http-nio-8080-ClientPoller-0"
Exception in thread "http-nio-8080-exec-6" java.lang.OutOfMemoryError: Java heap space
09-Apr-2021 10:11:39.896 SEVERE [http-nio-8080-exec-9] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Failed to complete processing of a request
        java.lang.OutOfMemoryError: Java heap space

One can read some recommendations and advices about mistakes to avoid for setting liveness/readyness probes. One is to nether check for external dependencies. It is readily apparent that doing this could lead to a restart of all pods under high load situations if for example the database did not answer in time rendering your service unavailable.

As for the recommendations a readiness probe could check for the service to be reachable. Maybe checking "/" for a 2xx response could be sufficient if the FROST-Server makes no connection to the database at this location. For the liveness probe I suggest to check the memory consumption. But I'm no java programmer so I don't know if this would help if the error above is thrown.

FraunhoferIOSB / FROST-Server

implement liveness and readyness probes #269