zhangbo1882 commented 1 year ago

Now envoy will use main thread to export the metrics with admin interface ((statshandler.handlerPrometheusStats). In normal case, it is OK. However, if there are huge metrics need to be exported, it will cost much time so that main thread will be blocked to handle some critical event such as the xDS configuration.

In our production environment, we have met such issues. When the endpoint changes, envoy cannot handle the EDS in time. It will impact the traffic to the new endpoint.

I

zuercher commented 1 year ago

cc @jmarantz

jmarantz commented 1 year ago

Before we add more threads I think there's some low-hanging fruit optimizing the Prometheus handler in a manner parallel to the other handlers.

See https://github.com/envoyproxy/envoy/issues/16139 -- I can help review this but would prefer for someone else to drive it, as we do not use Prometheus format.

zhangbo1882 commented 1 year ago

Any suggestions? @jmarantz

jmarantz commented 1 year ago

Are you asking for suggestions on which particular people could do this C++ work? Or are you thinking of doing it and want suggestions on how to get started?

zhangbo1882 commented 1 year ago

From my side, we met the performance issue in our production environment. The first thing is that how to avoid such issue or some workaround. From previous comment, you mentioned that we can optimize the prometheus handler. Is there anybody working on it?

jmarantz commented 1 year ago

Not yet. I thought there might be interest in working on this, but nothing has materialized yet.

l8huang commented 1 year ago

https://github.com/envoyproxy/envoy/issues/16139 is about memory issue, which is different from this issue.

This issue is about CPU usage, one option is splitting Admin API and XDS config update to different threads(Admin API on main thread, moves XDS config update to another thread; vice versa). What's your thought? @jmarantz

jmarantz commented 1 year ago

16139 's title is about memory but it's just as much about CPU. @rulex123 is going to work on this (yay!). I believe the CPU overhead will be reduced considerably once we are not copying the entire set of stats three times (once into maps, the second time into a giant string, the third time into a giant Buffer). Once that lands lets re-measure and see if we want to use another thread.

Another thought that has been shared in the past is having the admin listener land request-handling on a normal worker thread. That would have the (arguable) benefit of allowing unrelated admin handlers to run concurrently. However it would mean we'd have to make the admin infrastructure thread-safe (to the extent that it isn't already).

In either case there are probably assumptions in various places that the admin handlers run on the main thread and we'll have to be careful to suss out all those and make them thread-safe.

l8huang commented 1 year ago

@jmarantz Thank you and @rulex123, it's great to hear that CPU consumed by Admin API call(especially /stats/promethues) will be reduced by streaming.

IMHO, splitting admin handlers and XDS update to different thread is still desirable -- considering the scenario which has multiple scrapers call /stats/promethues in short interval and the number of metrics is large, the XDS update(state-of-world) could be slow down by metrics scraping.

kb000 commented 1 year ago

I suggest investigating what immediate performance gains could be had by using a non-http stat sink. For example, configuring https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/metrics_service.proto appears to stream metrics in prometheus-format protos via gRPC. You may need some other piece of infra to bridge the gap between your current collector/scraper and this gRPC stat_sink, but at least your admin thread won't be doing all of the copying mentioned by @jmarantz.

jmarantz commented 1 year ago

This is an interesting idea. I would not expect moving to gRPC would make anything more efficient. Actually given it's gRPC (built on http2 built on SSL) it might be less efficient, but kicking the collection off to another thread will at least free up the main thread for xDS handling, if that's the main goal.

I have not looked at the metrics service code at all and I'm not sure how it scales with huge numbers of stats.

jmarantz commented 1 year ago

One other idea: we could leave admin on the main thread if that's easy, and have another dedicated thread for collecting stats in chunks, posting them back to the main thread for sending to the network.

Note that the Prometheus stats handler does not yet provide chunked responses, but @rulex123 is working on that. The other stats handlers provide 2M chunks (this is a compile-time constant).

If the collection of stats is fully done on worker threads, and is not finely chunked enough, it might cause latency hiccups for end-users' requests. Note that in Envoy the worker-thread silo model constrains incoming requests to be fully handled on a single thread, but that thread can be multiplexed with other requests. So if one of those requests uses the CPU for too long (say, collecting all the stats in a buffer without yielding) then active end-user requests can be stalled. IDK if we want to go in this direction.

OTOH if a new dedicated thread is allocated for stats collection, and it's likely to be mostly busy, we might want to allocate one less worker thread, so we don't starve worker threads for cores.

The above two issues are why I would prioritize #16139 over this, if I have to pick one. That effort strictly makes things more efficient than they are today. This issue is just about moving the load around to another thread, and compute-bound threads are not an infinite commodity.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

zhangbo1882 commented 6 months ago

Any update on this issue ?

envoyproxy / envoy

Use different thread to export metrics #23976