PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
23.48k stars 1.03k forks source link

Zero downtime rolling updates on EKS #3633

Closed tpowellocto closed 4 months ago

tpowellocto commented 4 months ago

Environment

Description of issue

Minor service outage caused by cycling postgrest k8s pods. When the service is under constant load, taking a functional pod offline causes some requests to be dropped (result 502), plus a small number of requests to timeout.

(Expected behavior vs actual behavior) Connections to an active pod should be allowed to drain before application is stopped.

(Steps to reproduce: Include a minimal SQL definition plus how you make the request to PostgREST and the response body)

wolfgangwalther commented 4 months ago

A popular method of resolving this seems to be by running a sleep command on the preStop lifecycle hook. This is not possible with the current container image as no shell utilities are packaged within the image (its built from SCRATCH).

You can always create your own docker image and just use something like COPY --from=postgrest/postgrest:xyz /bin/postgrest /bin to get the static executable into your derived image. You can then use all the tools you want.

tpowellocto commented 4 months ago

You can always create your own docker image

@wolfgangwalther This has been my solution to date. The described issue definitely makes the provided (official) image less useful though.

wolfgangwalther commented 4 months ago

I have not looked really looked into the issue itself, but: Is there any solution that we can provide via PostgREST natively without more tools inside the container?

If the only solution is to supply more tools in the container, then we're at "closing, won't fix, there's a workaround", I guess.

wolfgangwalther commented 4 months ago

There seems to be a proposal to make the sleeping in preStop a part of k8s itself: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3960-pod-lifecycle-sleep-action

Since we have a workaround now and there is ongoing effort to solve this upstream, I guess we can close this. If you disagree, feel free to re-open with a suggestion on what we could do instead.