Kubernetes -- dedicated worker for healthcheck ?

Hi, is it possible to configure Gunicorn to have a dedicated worker to handle a specific rest endpoint (/healthcheck) that Kubernetes liveness and readiness probes will call ?

Seeing an issue where if a bunch of real requests are waiting while the service calls slow external service or database, then the healthcheck request gets queued and ends up taking too long, and kubernetes marks as failed.

If I simply increase the worker count, I might be back in the same situation tomorrow.

Ideally, I could have a dedicated worker who only handles the /healthcheck endpoint. I tried playing around with binding multiple addresses but I don't know if it's possible to map a particular bound address to particular worker(s)

Are you using an async worker class?

If not, then that's the direction I'd pursue.

If so, then something else is going on. Perhaps you're unknowingly making a blocking call somewhere--maybe not directly but through some library?

With an async worker class it’s still possible to get in this situation.

But doesn’t allocating a dedicated worker just for liveness and readiness checks reduce the value of those checks? Their value lies in predicting whether your app can process a real client’s request successfully in a timely fashion. If your workers are actually too busy to do that (with however Gunicorn normally routes requests to workers), then allocating a dedicated worker just for liveness and readiness checks will just mask the problem. (And in between checks, you’d have your dedicated liveness check worker taking up memory but sitting idle rather than actually contributing to serving any real requests that come in.)

To actually mitigate rather than mask the problem, is there any room for improvement in how Gunicorn currently routes requests to workers in general?

A complementary idea could be for Gunicorn to accept an option that would allow it to automatically scale up the number of workers temporarily when it detects that they’re under heavy enough load, making it actually elastic and a bit more resilient to bursts that it couldn’t otherwise handle.

With an async worker class it’s still possible to get in this situation.

With a sync worker, you will /surely/ get in this situation (at scale).

With a properly designed async worker, the only time your health check will block is when it actually should, i.e. when all worker threads/greenlets/whatevers are occupied. This is the behavior you should want. (Does your application have some unique health check requirement that necessitates more than this?)

But doesn’t allocating a dedicated worker just for liveness and readiness checks

Maybe I missed something, but I don't know why you're bringing up a "dedicated worker." I suggested no such thing. What I suggested was that you design your service such that it doesn't block. In the absence of more information, your blocking workers are the likeliest reason for your failing health check. Which is why I asked:

Are you currently using sync workers?

Thinking about it some more, in this situation, the k8s liveness probe should pass and the readiness probe should fail.

I'd want the liveness probe to fail only if the running app [in the container in the pod] is unresponsive in a way that can be fixed only by a restart. This is not such a case, just let the current request finish and app is responsive again, assuming appropriate timeouts are set for when this app is client to external services. But readiness probe should fail here so that k8s temporarily removes the pod from load balancer until it can handle traffic again. So that would mean readiness endpoint uses original worker pool, but liveness endpoint still needs own worker to respond immediately, not that the service is necessarily ready to accept regular traffic

I am using sync workers. As far as I'm told I cannot use async because of some libraries in use, but I will see for myself. I can always just increase the number of workers if anything.

@ronrothman writes:

Maybe I missed something, but I don't know why you're bringing up a "dedicated worker." I suggested no such thing.

The title of this issue is "dedicated worker for healthcheck". That's what I was responding to.

With a properly designed async worker, the only time your health check will block is when it actually should, i.e. when all worker threads/greenlets/whatevers are occupied.

I agree with this, and I also encourage the use of async workers.

But:

This is the behavior you should want.

I disagree with the idea that there is no way we could improve to Gunicorn to behave better in this situation. For example, adding an option to automatically scale out workers temporarily when all the current workers are busy, as I suggested.

Here's a program you can use to experiment:

# test.py:

#!/usr/bin/env python3

from contextvars import ContextVar
from time import sleep
from uuid import uuid4

_worker_id = ContextVar("worker_id")
_request_id = ContextVar("request_id")

def test_app(environ, start_response):
    start_response("200", [])
    worker_id = _worker_id.get()
    request_id = _request_id.get()
    if environ["PATH_INFO"] == "/heartbeat":
        print(f"{worker_id=}: /heartbeat -> responding right away [{request_id=}]")
        return [b"still alive"]
    print(f"{worker_id=}: non-heartbeat -> sleeping [{request_id=}]")
    sleep(9999)  # simulate having to perform a long-running (including CPU-bound) operation
    return [b"yawn"]

def _pre_request(worker, request):
    worker_id = worker.pid
    request_id = uuid4().hex
    _worker_id.set(worker_id)
    _request_id.set(request_id)
    print(f"Gunicorn dispatched {request_id=} to {worker_id=}")

def _post_request(worker, request, environ, response):
    worker_id = _worker_id.get()
    request_id = _request_id.get()
    print(f"{worker_id=} finished processing {request_id=}")

def main():
    from sys import argv
    from gunicorn.app.wsgiapp import WSGIApplication
    argv[:] = argv + [f"{__name__}:{test_app.__name__}"]
    app = WSGIApplication()
    app.cfg.settings['pre_request'].value = _pre_request
    app.cfg.settings['post_request'].value = _post_request
    app.run()

if __name__ == "__main__":
    main()

In one terminal:

$ ./test.py -k gevent  # Note the use of an async worker
[2020-12-02 09:06:46 -0500] [60910] [INFO] Starting gunicorn 20.0.4
[2020-12-02 09:06:46 -0500] [60910] [INFO] Listening at: http://127.0.0.1:8000 (60910)
[2020-12-02 09:06:46 -0500] [60910] [INFO] Using worker: gevent
[2020-12-02 09:06:46 -0500] [60911] [INFO] Booting worker with pid: 60911

In another terminal:

$ curl http://127.0.0.1:8000/ &

In the first terminal, you should see something like:

Gunicorn dispatched request_id='8dc558cda93e4f6f981f5f080d953069' to worker_id=60911
worker_id=60911: non-heartbeat -> sleeping [request_id='8dc558cda93e4f6f981f5f080d953069']

In the second terminal:

$ curl http://127.0.0.1:8000/heartbeat  # this hangs because all workers are currently busy

And you should see no output in the first terminal (the pre-request hook hasn't run yet for the /heartbeat request, Gunicorn is waiting for a worker to become free).

In a third terminal:

$ kill -TTIN 60910  # (substitute your master pid)

Now observe in the first terminal:

[2020-12-02 09:07:01 -0500] [60910] [INFO] Handling signal: ttin
[2020-12-02 09:07:01 -0500] [60962] [INFO] Booting worker with pid: 60962
Gunicorn dispatched request_id='a4f25ee751c14cb8a70230dce05261c1' to worker_id=60962
worker_id=60962: /heartbeat -> responding right away [request_id='a4f25ee751c14cb8a70230dce05261c1']
worker_id=60962 finished processing request_id='a4f25ee751c14cb8a70230dce05261c1'

And now the curl http://127.0.0.1:8000/heartbeat that was hanging in the second terminal will successfully return with "still alive".

If Gunicorn had an option to intelligently auto-scale, so that it would no longer be necessary to manually send it TTIN and TTOU signals, it could help make the systems that are built with Gunicorn a lot more resilient to transient bursts in load.

@matthew-walters writes:

Thinking about it some more, in this situation, the k8s liveness probe should pass and the readiness probe should fail.

I agree with this. The liveness probe should pass, because if it fails, k8s will restart the pod, which does not actually help with the problem that there is more work to do than the workers can handle. ( This assumes busy workers are actually doing useful work while they're busy, which they should be; if they're not, restarting is just a band-aid.) OTOH, failing the readiness probe tells k8s to leave the pod running (so it can continue doing useful work), but not to route anymore traffic to it while it's busy until it's ready for more.

isn't the purpose of an health check to know if your server is up and in good health? then any http handler answering to a request is ok. you don't need to have smth dedicated to it which would represent badly the real health of your server. if your health check is blocking then this is what need to be fixed.

On Wed 2 Dec 2020 at 15:32 Joshua Bronson notifications@github.com wrote:

@RonRothman https://github.com/RonRothman writes:

Maybe I missed something, but I don't know why you're bringing up a "dedicated worker." I suggested no such thing.

The title of this issue is "dedicated worker for healthcheck". That's what I was responding to.

With a properly designed async worker, the only time your health check will block is when it actually should, i.e. when all worker threads/greenlets/whatevers are occupied.

I agree with this, and I also encourage the use of async workers.

But:

This is the behavior you should want.

I disagree with the idea that there is no way we could improve to Gunicorn to behave better in this situation. For example, adding an option to automatically scale out workers temporarily when all the current workers are busy, as I suggested.

Here's a program you can use to experiment:

test.py:

!/usr/bin/env python3

from contextvars import ContextVarfrom time import sleepfrom uuid import uuid4

_worker_id = ContextVar("worker_id")_request_id = ContextVar("request_id")

def test_app(environ, start_response): start_response("200", []) worker_id = _worker_id.get() request_id = _request_id.get() if environ["PATH_INFO"] == "/heartbeat": print(f"{worker_id=}: /heartbeat -> responding right away [{request_id=}]") return [b"still alive"] print(f"{worker_id=}: non-heartbeat -> sleeping [{request_id=}]") sleep(9999) # simulate having to perform a long-running (including CPU-bound) operation return [b"yawn"]

def _pre_request(worker, request): worker_id = worker.pid request_id = uuid4().hex _worker_id.set(worker_id) _request_id.set(request_id) print(f"Gunicorn dispatched {request_id=} to {worker_id=}")

def _post_request(worker, request, environ, response): worker_id = _worker_id.get() request_id = _request_id.get() print(f"{worker_id=} finished processing {request_id=}")

def main(): from sys import argv from gunicorn.app.wsgiapp import WSGIApplication argv[:] = argv + [f"{name}:{test_app.name}"] app = WSGIApplication() app.cfg.settings['pre_request'].value = _pre_request app.cfg.settings['post_request'].value = _post_request app.run()

if name == "main": main()

In one terminal:

$ ./test.py -k gevent # Note the use of an async worker [2020-12-02 09:06:46 -0500] [60910] [INFO] Starting gunicorn 20.0.4 [2020-12-02 09:06:46 -0500] [60910] [INFO] Listening at: http://127.0.0.1:8000 (60910) [2020-12-02 09:06:46 -0500] [60910] [INFO] Using worker: gevent [2020-12-02 09:06:46 -0500] [60911] [INFO] Booting worker with pid: 60911

In another terminal:

$ curl http://127.0.0.1:8000/ &

In the first terminal, you should see something like:

Gunicorn dispatched request_id='8dc558cda93e4f6f981f5f080d953069' to worker_id=60911 worker_id=60911: non-heartbeat -> sleeping [request_id='8dc558cda93e4f6f981f5f080d953069']

In the second terminal:

$ curl http://127.0.0.1:8000/heartbeat # this hangs because all workers are currently busy

And you should see no output in the first terminal (the pre-request hook hasn't run yet for the /heartbeat request, Gunicorn is waiting for a worker to become free).

In a third terminal:

$ kill -TTIN 60910 # (substitute your master pid)

Now observe in the first terminal:

[2020-12-02 09:07:01 -0500] [60910] [INFO] Handling signal: ttin [2020-12-02 09:07:01 -0500] [60962] [INFO] Booting worker with pid: 60962 Gunicorn dispatched request_id='a4f25ee751c14cb8a70230dce05261c1' to worker_id=60962 worker_id=60962: /heartbeat -> responding right away [request_id='a4f25ee751c14cb8a70230dce05261c1'] worker_id=60962 finished processing request_id='a4f25ee751c14cb8a70230dce05261c1'

And now the curl http://127.0.0.1:8000/heartbeat that was hanging in the second terminal will successfully return with "still alive".

If Gunicorn had an option to intelligently auto-scale, so that it would no longer be necessary to manually send it TTIN and TTOU signals, it could help make the systems that are built with Gunicorn a lot more resilient to transient bursts in load.

@matthew-walters https://github.com/matthew-walters writes:

Thinking about it some more, in this situation, the k8s liveness probe should pass and the readiness probe should fail.

I agree with this. The liveness probe should pass, because if it fails, k8s will restart the pod, which does not actually help with the problem that there is more work to do than the workers can handle. ( This assumes busy workers are actually doing useful work while they're busy, which they should be; if they're not, restarting is just a band-aid.) OTOH, failing the readiness probe tells k8s to leave the pod running (so it can continue doing useful work), but not to route anymore traffic to it while it's busy until it's ready for more.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/2467#issuecomment-737265993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIRT7H2LTUOGIKZ3BXDSSZFWDANCNFSM4UJ3VNKQ .

-- Sent from my Mobile

isn't the purpose of an health check to know if your server is up and in good health? then any http handler answering to a request is ok. you don't need to have smth dedicated to it which would represent badly the real health of your server. if your health check is blocking then this is what need to be fixed.

Exactly.

If 100% of your workers are busy, then your server is effectively down, because it cannot handle the next request. The solution is not to take it out of service at the load balancer; the solution is to avoid having your workers become saturated in the first place. (Either by lowering latency, adding more workers, or writing nonblocking workers; or some combination.)

I disagree with the idea that there is no way we could improve to Gunicorn to behave better in this situation. For example, adding an option to automatically scale out workers temporarily when all the current workers are busy, as I suggested.

This is the purpose of an external program. The design of gunicorn is deliberately simple on that as no scaling logic should be handled at tha level though you can extend gunicorn using the hooks. To do it you have the following possibilities and probably others I forget:

1) when bad health is detected (using metrics or a check tool), you can do either or a mix of the following actions:

increase the number of workers using the TTIN signal. And decrease later using TTOU when conditions are coming back to normal
kill a faulty worker when you see it starting to use too much memory/cpus ...
start a new gunicorn ...

One interresting way to detect badhealth instead of using a poller is to monitor some useful metrics. for example the delay to make a request, number of timeouts, errors... SImple way is to monitor the health and on some events do the following above.

2) with hooks you can eventually send internal metrics and get for a remote server some useful informations that could trigger some stuff by killing the currrent worker or sending a signal to current master

I thought of some alternative approach. For the liveness probe, rather than using the httpGet type of probe where the k8s kubelet makes a rest call to specified endpoint (in this case /healthcheck), instead, use the exec type of probe where it executes a command in the container https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command. I'd specify

livenessProbe:
      exec:
        command:
        - /bin/bash
        - -c 
        - "timeout 1 bash -c '</dev/tcp/localhost/5000' 2>/dev/null"

This command will return status 0 if it can connect and 1 if it cannot. Confirmed that even if the worker is busy handling a requests, this won't get stuck waiting.

Thanks for the tip about bash's network redirection, @matthew-walters, cool trick! I'd been using nc -z to do the same thing. Good to know there's a cheaper option (at least on Linux)!

Curious if anyone reading this thread has seen any documentation of best practices for liveness and readiness checks for web services. For example, is a liveness check that merely tests that a TCP connection succeeds (as above) recommended, or is it better to test that a successful HTTP(S) request can be made to an actual application endpoint? What failure modes -- where restarting can actually help (given that failing a liveness check causes K8s to restart you) -- are detectable by the latter, but not by the former (e.g. cert expiration in the case of HTTPS, exercising custom health check logic at the application layer that checks application dependencies like a database, another service, etc.) and are they worth the additional cost? Etc. etc.

Quoting @benoitc:

This is the purpose of an external program. The design of gunicorn is deliberately simple on that as no scaling logic should be handled at tha level though you can extend gunicorn using the hooks.

Makes sense. Are you aware of anything we can use for this? Specifically, some kind of dedicated Gunicorn supervisor tool that handles things like auto-scaling, cert rotation, etc.? I searched but didn't immediately find anything, but that's kind of surprising given what you wrote, plus how popular Gunicorn is and how long it's been around.

For readiness probes, I think it's best practice to use a full HTTP request because readiness probe failures do not cause container restarts. The application can actually return non-2xx responses for readiness probes to indicate temporary maintenance mode, such as if you need to stop traffic to a pod but want to be able to debug it's state.

For liveness probes, whatever is likely to be remedied by a container restart :). There's a bunch of discussion in #1417 about bypassing the connection queue, but I think it's ultimately not good to get into a situation where workers take longer than the health check timeout to handle a request.

Unfortunately, auto-scaling is really tricky and sometimes failures cascade or systems do the wrong thing! For example, if a service starts failing, upstream services might downscale because their load decreases. When the downstream service becomes healthy, there may be a stampede! Similarly, restarting a container because its connection backlog is too full will not help process requests any faster!

Common to some of these scenarios is that backpressure needs to be high signal and low noise. It may be better to have a short connection backlog and let requests from a load balancer fail, trigger the load balancer to shift traffic to another instance earlier, and to have slightly higher reserve capacity for surges rather than rely on the backlog.

It may sometimes be the right thing to tune the load balancer or a reverse proxy so that it distributes load more evenly, or to scale based on latency rather than CPU, or whatever. There's only so much Gunicorn can try to do!

Thanks for the reply, @tilgovi! Just wanted to say that I kept looking after I last posted here, and eventually found https://aws.amazon.com/builders-library/implementing-health-checks/. I'm sure it's possible to write a book on this nuanced topic, but that's the most comprehensive resource I've found yet. Hope this might help anyone else who finds their way here.

This is the purpose of an external program. The design of gunicorn is deliberately simple on that as no scaling logic should be handled at tha level though you can extend gunicorn using the hooks.

Makes sense. Are you aware of anything we can use for this? Specifically, some kind of dedicated Gunicorn supervisor tool that handles things like auto-scaling, cert rotation, etc.? I searched but didn't immediately find anything, but that's kind of surprising given what you wrote, plus how popular Gunicorn is and how long it's been around.

OTOH, I haven't yet found anything addressing this gap since I posted the above. If anyone else knows of anything, I'd be super interested to hear about it.

benoitc / gunicorn

Kubernetes -- dedicated worker for healthcheck ? #2467

test.py:

!/usr/bin/env python3

If Gunicorn had an option to intelligently auto-scale, so that it would no longer be necessary to manually send it TTIN and TTOU signals, it could help make the systems that are built with Gunicorn a lot more resilient to transient bursts in load.