Use K8s Liveness & Readiness Probes

arm4b commented 6 years ago

Copied from https://github.com/StackStorm/k8s-st2/issues/5

Use Kubernetes Livenes and Readiness probes to check if pod container is ready/working or not. For example, st2 services could start, but in fact be in unreachable or "initializing" state, meaning potential loss of requests.

This becomes important as we reach the Production Deployments stage.

There is an issue in to track the implementation progress in StackStorm core: https://github.com/StackStorm/st2/issues/4020 (help wanted!)

Resources

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
How readinessProbe differs from livenessProbe: https://www.ianlewis.org/en/using-kubernetes-health-checks

arm4b commented 6 years ago

With liveness probes in mind, we may need to fine-tune Deployment .spec.minReadySeconds (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#min-ready-seconds) to make sure it fits specific st2 service startup profile.

Min Ready Seconds

.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available. This defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about when a Pod is considered ready, see Container Probes.

arm4b commented 6 years ago

st2web liveness/readiness probe could be first implemented via exec.command doing several curl && curl && curl requests to /api/auth/stream endpoints.

cognifloyd commented 3 years ago

Thinking about readinessProbe we could probably approximate something with lifecycle.postStart.exec.command that waits for some condition like some log line being printed out (if it has access to the logs) or for the main process to create a stable rabbitmq connection (as sniffed by ss or netstat). This would work because the pod is not marked as ready until after postStart exits successfully.

The same could in readinessProbe.exec.command as well, as long as it wasn't watching for a log line.

These use an eventlet.wsgi loop, so using a tcp probe might be appropriate:

st2auth
st2api
st2stream

These pods use st2common.transport.consumers.MessageHandler.start() and so they're ready shortly after connecting to RabbitMQ:

st2rulesengine
st2actionrunner
st2notifier
st2workflowengine
st2scheduler (this has a couple other eventlet threads that need to start, but MQ connection would be a good indicator)

This pod does not use MessageHandler, but it should still be ready shortly after MQ connection

st2timersengine (MQ connection & apscheduler have to start)

For st2sensorcontainer pods defined in st2.packs.sensor, we don't need to worry about the probes. If we did, we a MQ connection would still be a good indicator as that should happen after the sensor processes get spanwed.

When we implement this in st2 itself - maybe we'll end up with a MQ based heartbeat for each service to say "I'm up". I'm not sure how that could interact with k8s, but it's a thought. Hmm. or maybe we have a mini script on each pod, that sends a ping message to the MQ and all that has to happen for it to be live is for the service to respond back with a pong. That would allow k8s to check the liveness with command.exec without exposing an http endpoint on services that don't already have endpoints.

StackStorm / stackstorm-k8s

Use K8s Liveness & Readiness Probes #3

Resources