Prometheus stats handler used too much memory.

cocotyty commented 3 years ago

If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately.

Title: One line description Prometheus stats handler used too much memory.

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should not crash, the expected value isn't returned, etc.

Repro steps:

Include sample requests, environment, etc. All data and inputs required to reproduce the bug.

Note: The Envoy_collect tool gathers a tarball with debug logs, config and the following admin endpoints: /stats, /clusters and /server_info. Please note if there are privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats, /clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

Note: If there are privacy concerns, sanitize the data prior to sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required. Please refer to the Bazel Stack trace documentation.

cocotyty commented 3 years ago

https://github.com/envoyproxy/envoy/blob/8a2143613d43d17d1eb35a24b4a4a4c432215606/source/server/admin/prometheus_stats.cc#L211

Maybe use http chunks to response can fix this problem.

cocotyty commented 3 years ago

@alyssawilk @lizan

alyssawilk commented 3 years ago

cc @jmarantz

jmarantz commented 3 years ago

Agreed -- I think we should stream out stats output from admin rather than buffering all of it. This would be a good project for someone that was comfortable with HTTP and C++ but doesn't need to have excessively deep knowledge of Envoy.

I have looked at this deeply in the past and could advise, but don't have time to tackle myself.

daixiang0 commented 3 years ago

I want to do this, please share more details, thanks @jmarantz

jmarantz commented 3 years ago

The main challenge here is that it's probably simplest to make all the admin handlers stream.

The admin system defines a handler API in ./include/envoy/server/admin.h

#define MAKE_ADMIN_HANDLER(X)                                                                      \
  [this](absl::string_view path_and_query, Http::ResponseHeaderMap& response_headers,              \
         Buffer::Instance& data, Server::AdminStream& admin_stream) -> Http::Code {                \
    return X(path_and_query, response_headers, data, admin_stream);                                \
  }

I think we should define a new class Chunker to manage the chunking process and replace Buffer::Instance& in that handler prototype with a Chunker&, and to allow handlers to report an error code to the chunker rather than returning it.

class Chunker {
public:
  void add(absl::string_view);
  void reportError(Http::Code code, absl::string_view error_text); // Cannot call this after add()
}';

Most of the handlers call only Buffer::add so we just need to change the data type. I haven't done a thorough survey of all the handlers so this needs to be verified. You can implement Chunker initially to just buffer up all the data and it can be sent all at once, just to bound the size of the first PR.

Then, you can do a second PR to get streaming working in source/server/admin/admin_filter.cc and source/server/admin/admin.cc. The simplest thing is to just to stream out a chunk every time add() is called, but some handlers call add() on very small chunks and that wouldn't be efficient. Instead you could add some intelligence to Chunker so it buffers up to some sensible fixed size before emitting that to the network.

There may already be a class in Envoy that does chunking like this. It's worth asking on envoy-dev in Slack.

ggreenway commented 3 years ago

Also note that some of the constraints of the prometheus format require collecting all the metrics to sort them. But I think peak memory use could be reduced by quite a bit by streaming out the textual format of the stats; that doesn't need to be collected as it is now. See comments: https://github.com/envoyproxy/envoy/blob/ca4ca983195f337a840618b45b9b33cda82c1224/source/server/admin/prometheus_stats.cc#L73-L88

jmarantz commented 3 years ago

Agreed. And keeping all the stats of a given type in a vector of pointers is not free, but it's a fraction of the cost of all the serialized strings and values held in memory to be emitted all at once.

cocotyty commented 3 years ago

https://prometheus.io/docs/instrumenting/writing_clientlibs/#exposition

I think we don't need sort them .

ggreenway commented 3 years ago

Agreed; I'm fine with not sorting if it reduces resources used.

jmarantz commented 3 years ago

We can also have option ?sort for those that want it and are willing to wait for it and consume more memory.

cocotyty commented 3 years ago

We can also have option ?sort for those that want it and are willing to wait for it and consume more memory.

Agreed 👍

daixiang0 commented 3 years ago

/assign

repokitteh-read-only[bot] commented 3 years ago

daixiang0 is not allowed to assign users.

:cat:

Caused by: a https://github.com/envoyproxy/envoy/issues/16139#issuecomment-829036926 was created by @daixiang0. see: [more](https://github.com/envoyproxy/envoy/issues/16139#issuecomment-829036926), [trace](https://prod.repokitteh.app/traces/ui/envoyproxy/envoy/d34ead90-a8c3-11eb-9746-bd0b1c5f5ea7).

cocotyty commented 3 years ago

@daixiang0 thank you for your works

jmarantz commented 3 years ago

@daixiang0 any progress being made on this issue? When do you think you'll be able to work on it?

daixiang0 commented 3 years ago

Hi, @jmarantz I am working on it, would create a PR soon.

alyssawilk commented 3 years ago

I'm not clear on how the interface suggested is going to help with taking up too much memory. Aren't we going to have the same problem with slow clients, where we may add chunks faster than they're sent downstream? I think we'd want an API for adding body data, and getting watermark notifications, at which point it starts to look a lot like a filter. I wonder if we should use some of the existing APIs; instead of sending data we addEncodedData and subscribe to watermark callbacks, or if we think that's too complicated we have a generic AdminFilter and wrapper those calls more simply. Thoughts?

alyssawilk commented 3 years ago

cc @mattklein123 @snowp

jmarantz commented 3 years ago

Yeah I wasn't exactly sure what the best mechanism is to stream the data out. But right now 100% of the serialized bytes are always buffered in the admin-handler. Then they may also be buffered in the networking layer for a slow client.

If the mechanism we use to send bytes the network has a bounded buffer itself and (say) blocks if that's full, then I think the strategy proposed here will reduce memory as expected. Another alternative is that we do an async call to a network write() for each chunk and don't write anymore until we get a done callback. As you can tell I haven't worked much at this layer in Envoy.

If the only available API is non-blocking and infinite-buffering, then you are right: this strategy will not be able to fully bound memory. I'm hoping that's not the case. In that scenario I don't know how to prevent a 1G response from consuming 1G of memory independent of client read-rate.

mattklein123 commented 3 years ago

I agree with @alyssawilk that at a certain point we are just re-implementing all of the flow control behavior we already have, so it would be nice to just reuse those watermark APIs.

jmarantz commented 3 years ago

This makes sense as a better way to go. It's probably worth rethinking this PR in terms of just better usage of the existing filter APIs rather than adding the Chunker API. @daixiang0 sorry for the detour.

PTAL at admin_fillter.cc/.h, the other usage of filters, and the existing relationship of admin.cc to admin_filter.cc.

At a high level I think the proposed strategy still makes sense: break up the output streaming into smaller chunks sent to the network directly from the filter. It may require quite a bit of refactoring to get there.

ggreenway commented 3 years ago

One other consideration: since this is for stats, the values should all be latched as close together in time as possible. So even if we don't generate the entire text representation at once, we would need to copy all the values (maybe a big vector of pair<StatName, value>?)

Also, for the use case of prometheus stats, I think the most common use case is a fast client on a fast network. My feeling is that the best tradeoff we'll get is generating them similar to how it's done now, but call send() for each chunk so the OS can push them out, and if send() gives backpressure, just accumulate in a Buffer like we do now.

daixiang0 commented 3 years ago

@jmarantz never mind, it is still a nice idea: split into chunks rather than hold all data stream, would save memory cost.

@alyssawilk @mattklein123 thanks for input, I am really not familiar with those fields, learn a lot from you!

ME-ON1 commented 3 years ago

HI, Is this issue still open for contributing ??

daixiang0 commented 3 years ago

/unaasign

@ME-ON1 you can go ahead, I am busy on other things.

jmarantz commented 2 years ago

@ggreenway

jmarantz commented 1 year ago

@rulex123 are you thinking of working on this? There definitely seems to be some interest in it, but no one has stepped up to take it on yet. I'm happy to advise, review, etc.

rulex123 commented 1 year ago

@jmarantz happy to help out with this. I see you already started work on a streaming implementation for stats.. do you have a rough plan in mind already? What should I focus on?

jmarantz commented 1 year ago

Oh great! Do you want to ping me on Slack tomorrow and we can chat? I can outline what needs to be done.

rulex123 commented 1 year ago

Oh great! Do you want to ping me on Slack tomorrow and we can chat? I can outline what needs to be done.

I can ping you on Monday and we can catch up then, if that works for you?

jmarantz commented 1 year ago

https://github.com/envoyproxy/envoy/pull/24998 I think had closed this issue that that had to be reverted, so leaving this open.

jmarantz commented 1 year ago

The problem with the algorithm used to accelerate rendering non-prometheus stats (StatsRequest) is that the tag-extraction groups are composed of scalar stats that cross scope boundaries. For example, cluster_default_total_match_count exists in scope cluster.forward_cluster and scope cluster.local_hugo_serve. This is from @zirain 's test yaml referenced in https://github.com/envoyproxy/envoy/issues/27173.

To fix this, rather than flushing out the stats after each block of scopes with the same name we need to hold onto them until we are certain their can be no more additions to the group in alphabetically later scopes. E.g. we may have to hold onto all the name-groups with a "cluster." prefix until we get past all the clusters. I'm not certain this will be enough, as tag-extraction regexes are independent of the scope naming hierarchy. Also, cluster sub-scopes may be the majority of all stats, so if we have to build a map containing the counters from all scopes beginning with cluster. that is not that much cheaper than building the map for all stats.

I still think there's room to improve the current impl, but it will be more divergent than from the algorithm used to stream scalar stats for other stats formats, and I'm not sure how large the improvement can be but it is risk-free and an easy win.

Another way to look at this problem is that when we discover a new tag-extracted name, we can then find all other stats with the same tag-extracted name across all scopes. The data structures are not optimized for that, but maybe they could be if we want to invest effort and memory in this. One idea would be to have an alternate AllocatorImpl which could be instantiated instead of the current one in deployments with lots of stats and the desire to render them in Prometheus (e.g. by command-line flag). An alternate AllocatorImpl could add a map<tag-extracted-name, set<Counter*>> (not keeping a strong reference) and would be responsible for clearing Counter* from the sets when those counters are freed from the allocator (e.g. during xDS update dropping a cluster). And cleaning map entries pointing to empty sets. We would not want to pay the memory or overhead of keeping this accurate this unless Prometheus was used, so it would need to be a flag.

Another way to achieve that would be for the allocator to offer a delete-hook and create-hook API, so that the a Prometheus context could maintain its TagExtractedName -> Set maps. That might be an easier incremental change compared to injecting an alternate Allocator.

Either way, this would make the prometheus formatter faster, I think, but not necessarily take less memory, since we'd be maintaining similar structures all the time as the ones we are creating temporarily now. Since this memory burst happens every 5 seconds maybe it's better to maintain the structures than periodically rebuild them.

envoyproxy / envoy

Prometheus stats handler used too much memory. #16139