dunglas / frankenphp

🧟 The modern PHP app server
https://frankenphp.dev
MIT License
6.98k stars 244 forks source link

interface to collect necessary stats for building a k8s readiness probe. #1122

Open travisghansen opened 4 weeks ago

travisghansen commented 4 weeks ago

Describe you feature request

Is your feature request related to a problem? Please describe. I am very new to frankenphp so could certainly have missed something in the docs that allow me to do this.

I am trying to deploy frankenphp to k8s. I would like to sanely configure readiness probes to keep instances from getting overloaded with requests. I think this would predominantly come down to having 2 key metrics:

Additional metrics may be good:

Describe the solution you'd like I would like to have some sort of interface (http, cli that can be invoked, etc) which would export the above and perhaps more. I would then write a script to be executed which would retrieve that info, allow for some threshold of queued requests waiting for a thread, and if above the threshold k8s would stop sending traffic to that instance until things settle a bit.

Ideally the metrics would exclude any requests that are for non-php workers (ie: static files, etc).

Perhaps the caddy admin port already has some of this data?

Describe alternatives you've considered I can have an endpoint in my app, but that seems less than ideal as each check would itself take up a thread/worker.

dunglas commented 4 weeks ago

Hi, indeed, we expose a Prometheus endpoint for this usage, and metrics about workers will be available in the next release: #966.

travisghansen commented 4 weeks ago

Great! Thanks for the pointers! Those additions are timely for me :)

Is it possible to get:

withinboredom commented 4 weeks ago

@travisghansen see: https://github.com/dunglas/frankenphp/blob/main/docs/metrics.md

travisghansen commented 4 weeks ago

Thanks, I read that in the PR but I don’t see the above mentioned stats in there..am I missing something? I am essentially looking for some direct insight into the backlog (specifically the backlog of requests that will be handled by php workers…which may be impossible to know pre-emptively, I am unsure of the internals at play).

withinboredom commented 3 weeks ago

It's possible to know, but not possible for metrics to know in any sort of stable way. For example, if your prometheus scrape happens to be between the time it gets sent to the workers and the time it gets picked up by the worker, it will be greater than zero. However, really, it is just "bad timing."

If this is fine, then we can add it. But it may flap around quite a bit and not be useful, although I guess the "average trend" could be useful.

travisghansen commented 3 weeks ago

Yeah understood. There are a lot of factors at play there that would/could dramatically. The avg time I mentioned could help mitigate the problems you have pointed out with the number…especially if direct values were exported putting the numbers in different time buckets/spans. There is definitely a lot of nuance to the numbers and which to show so further discussion is definitely warranted IMO. For example maybe a better number is how many requests could not be immediately dispatched in the last 1 seconds, 3 seconds, 5 seconds.

Having said that, the ‘current number’ is still very useful and k8s provides the basic nobs to really help with that by giving the ability to set the frequency in which the check is run and the rise/fall numbers. Using those 3 pieces it is possible to create a liveness probe which very closely mimics an average over a desired period of time.

I also think the numbers could prove incredibly valuable for really locking in hpa logic as well. Closely related to liveness but very distinct concepts.

Thanks for the consideration!

withinboredom commented 3 weeks ago

Looking at how things work, this would put metrics directly into the "hot path" of request handling, which I would like to avoid. However, we can passively detect whether workers are stalling (requests are coming in faster than workers can respond) and how bad it is.

So what about a metric like:

frankenphp_[worker]_stalled_[1,3,5]: 0-1

Where the number shows a % over the last 1-5 seconds. This number is a representation of how full the worker request buffer is. In low utilization it is always 0. Once it goes above zero, latency tends to grow exponentially in my experiments.

What do you think about that?

travisghansen commented 3 weeks ago

I think that number is fantastic! Great idea.

Which number goes into the hot path? Is it trying to gather the amount of time each of the stalled requests has been waiting?

If I understand the number correctly I think there is 1 piece of missing context which would be incredibly useful (either a separate metric or encapsulated in a different metric/name altogether) which is the magnitude of the problem. I suppose this would only be relevant when the proposed stalled metric is 1..if I am 100% stalled, how big is the backlog?

I am completely new to frankenphp so am a little foggy still about how the dispatching/queue works so may be off on my thinking so please correct my understanding as necessary. For example I don’t know if the requests are simply round robin’d to each worker as they come in OR if there’s more intelligence scheduling in front of the workers (ie: do I technically have 1 backlog or do I have N backlogs, 1 for each worker?) doing least connection style logic etc.

travisghansen commented 3 weeks ago

269 Linking to related issue.

withinboredom commented 3 weeks ago

@travisghansen it looks like this is already doable due to a 'logic bug'.

frankenphp_[worker]_busy_workers will go above the number of workers if there are more requests than workers due to being incremented before we ever send the request to the worker. So, you can use frankenphp_[worker]_busy_workers - frankenphp_[worker]_total_workers to get the backlog, where <= 0 means no backlog.