caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
55.45k stars 3.91k forks source link

readiness probe #6403

Open elee1766 opened 1 week ago

elee1766 commented 1 week ago

It's rather valuable when running on an orchestrator to be able to query an http endpoint to know when an app is both 'live' and 'ready'.

People have had this problem in the past, and have used workarounds such as: https://github.com/caddyserver/caddy/issues/3441#issuecomment-633158798

however, I don't think that this workaround is perfect, because if I'm mounting multiple servers, I don't know if every caddy.App has finished its Start(), and so as a result, health could mistakenly be reported as true. I understand that I could health check my individual servers, and then configure one check per server - but now we are talking about more complicated hacks.

along with this - i believe it's possible for modules to still be mutating after Start() has been called, depending on how the module is implemented.

caddy currently already has mechanism to notify an arbitrary system when it is ready - so i don't believe this sort of ask is out of scope for the project https://github.com/caddyserver/caddy/blob/fab6375a8bebd952abc80e63fa31b648ae1ebc0b/cmd/commandfuncs.go#L239-L256. using this to try to engineer something that works well with modern orchestrators isn't very fun though.

does it make sense mount an endpoint somewhere here https://github.com/caddyserver/caddy/blob/master/caddy.go#L559 like GET /health/readiness to the admin api so that one could use it to determine when to start sending traffic to an app?

mohammed90 commented 1 week ago

Although Caddy calls Start on an app, there's no guarantee the app can accept traffic, or if the app is an app that accept traffic at all. For HTTP servers, it's possible to configure an endpoint with a strict matcher for requests on GET /health/readiness to return 200 OK. Modules that require more nuanced checks before replying with OK, they can attach custom routes to the admin endpoint by implementing a module in the admin.api. namespace.

https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/admin.go

Do neither of these help?

elee1766 commented 1 week ago

they can attach custom routes to the admin endpoint by implementing a module in the admin.api. namespace.

implementing a custom hook for plugins to notify their readiness in the admin API might work, but it means that I need to, from my understanding

  1. a module which can receive messages within the process and serves the http handler for readiness
  2. modify every single app to send a signal to the module on ready, likely by importing the first module and then calling it with something.

this seems like a nightmare, and I would have trouble doing it for caddyhttp since that's not my module. maybe there is a way that I am not seeing here.

Although Caddy calls Start on an app, there's no guarantee the app can accept traffic, or if the app is an app that accept traffic at all.

this is exactly part of the problem. some apps start and I can't check them for readiness, hence me wanting to wait for every Start() for each module to finish running before sending any traffic to caddy. the only way I can really know if caddy is ready is if all Start() hooks have executed without error, which there is no endpoint to figure this info out that I can see.

I basically want to use the same hook as systemd (done config loading), or the post startup tcp callback (after load), as the condition for marking caddy as ready for external programs.

the least intrusive hack I have thought of so far is to run a sidecar, or to bundle a second process in my docker images, that receives the post startup tcp message from caddy and then starts serving readiness on a different port that we can health check. but this feels really bad.

in our specific use case, we have a caddy app that is serving postgres over tcp. we originally used the admin api http endpoints to determine if we had started up but we realized doing that was wrong.

we then added tcp probe to check on the postgres port, and it's mostly good, but there is time between the start of listening and actually ready to serve, and it seems things are trying to connect before ready has completely finished running, resulting in lost/failed connections. we could change our plugin a little bit to reduce this, but we need information from the listening socket post bind to finalize our handler, so it's a bit difficult. we could bind to a second tcp port at the end of start, and probe there, but that seems wrong

overall I felt like wanting to wait for all Start() hooks to run before sending traffic to caddy is a pretty standard need (especially in the case of zero downtime redeploy), and so maybe it shouldn't be as difficult as either writing a set of custom modules, or writing your own health check endpoints for each app and configuring all of them.