Open cocotyty opened 3 years ago
Maybe use http chunks to response can fix this problem.
@alyssawilk @lizan
cc @jmarantz
Agreed -- I think we should stream out stats output from admin rather than buffering all of it. This would be a good project for someone that was comfortable with HTTP and C++ but doesn't need to have excessively deep knowledge of Envoy.
I have looked at this deeply in the past and could advise, but don't have time to tackle myself.
I want to do this, please share more details, thanks @jmarantz
The main challenge here is that it's probably simplest to make all the admin handlers stream.
The admin system defines a handler API in ./include/envoy/server/admin.h
#define MAKE_ADMIN_HANDLER(X) \
[this](absl::string_view path_and_query, Http::ResponseHeaderMap& response_headers, \
Buffer::Instance& data, Server::AdminStream& admin_stream) -> Http::Code { \
return X(path_and_query, response_headers, data, admin_stream); \
}
I think we should define a new class Chunker to manage the chunking process and replace Buffer::Instance&
in that handler prototype with a Chunker&
, and to allow handlers to report an error code to the chunker rather than returning it.
class Chunker {
public:
void add(absl::string_view);
void reportError(Http::Code code, absl::string_view error_text); // Cannot call this after add()
}';
Most of the handlers call only Buffer::add so we just need to change the data type. I haven't done a thorough survey of all the handlers so this needs to be verified. You can implement Chunker initially to just buffer up all the data and it can be sent all at once, just to bound the size of the first PR.
Then, you can do a second PR to get streaming working in source/server/admin/admin_filter.cc and source/server/admin/admin.cc. The simplest thing is to just to stream out a chunk every time add() is called, but some handlers call add() on very small chunks and that wouldn't be efficient. Instead you could add some intelligence to Chunker so it buffers up to some sensible fixed size before emitting that to the network.
There may already be a class in Envoy that does chunking like this. It's worth asking on envoy-dev in Slack.
Also note that some of the constraints of the prometheus format require collecting all the metrics to sort them. But I think peak memory use could be reduced by quite a bit by streaming out the textual format of the stats; that doesn't need to be collected as it is now. See comments: https://github.com/envoyproxy/envoy/blob/ca4ca983195f337a840618b45b9b33cda82c1224/source/server/admin/prometheus_stats.cc#L73-L88
Agreed. And keeping all the stats of a given type in a vector of pointers is not free, but it's a fraction of the cost of all the serialized strings and values held in memory to be emitted all at once.
https://prometheus.io/docs/instrumenting/writing_clientlibs/#exposition
I think we don't need sort them .
Agreed; I'm fine with not sorting if it reduces resources used.
We can also have option ?sort
for those that want it and are willing to wait for it and consume more memory.
We can also have option
?sort
for those that want it and are willing to wait for it and consume more memory.
Agreed 👍
/assign
daixiang0 is not allowed to assign users.
@daixiang0 thank you for your works
@daixiang0 any progress being made on this issue? When do you think you'll be able to work on it?
Hi, @jmarantz I am working on it, would create a PR soon.
I'm not clear on how the interface suggested is going to help with taking up too much memory. Aren't we going to have the same problem with slow clients, where we may add chunks faster than they're sent downstream? I think we'd want an API for adding body data, and getting watermark notifications, at which point it starts to look a lot like a filter. I wonder if we should use some of the existing APIs; instead of sending data we addEncodedData and subscribe to watermark callbacks, or if we think that's too complicated we have a generic AdminFilter and wrapper those calls more simply. Thoughts?
cc @mattklein123 @snowp
Yeah I wasn't exactly sure what the best mechanism is to stream the data out. But right now 100% of the serialized bytes are always buffered in the admin-handler. Then they may also be buffered in the networking layer for a slow client.
If the mechanism we use to send bytes the network has a bounded buffer itself and (say) blocks if that's full, then I think the strategy proposed here will reduce memory as expected. Another alternative is that we do an async call to a network write()
for each chunk and don't write anymore until we get a done
callback. As you can tell I haven't worked much at this layer in Envoy.
If the only available API is non-blocking and infinite-buffering, then you are right: this strategy will not be able to fully bound memory. I'm hoping that's not the case. In that scenario I don't know how to prevent a 1G response from consuming 1G of memory independent of client read-rate.
I agree with @alyssawilk that at a certain point we are just re-implementing all of the flow control behavior we already have, so it would be nice to just reuse those watermark APIs.
This makes sense as a better way to go. It's probably worth rethinking this PR in terms of just better usage of the existing filter APIs rather than adding the Chunker API. @daixiang0 sorry for the detour.
PTAL at admin_fillter.cc/.h, the other usage of filters, and the existing relationship of admin.cc to admin_filter.cc.
At a high level I think the proposed strategy still makes sense: break up the output streaming into smaller chunks sent to the network directly from the filter. It may require quite a bit of refactoring to get there.
One other consideration: since this is for stats, the values should all be latched as close together in time as possible. So even if we don't generate the entire text representation at once, we would need to copy all the values (maybe a big vector
of pair<StatName, value>
?)
Also, for the use case of prometheus stats, I think the most common use case is a fast client on a fast network. My feeling is that the best tradeoff we'll get is generating them similar to how it's done now, but call send()
for each chunk so the OS can push them out, and if send()
gives backpressure, just accumulate in a Buffer
like we do now.
@jmarantz never mind, it is still a nice idea: split into chunks rather than hold all data stream, would save memory cost.
@alyssawilk @mattklein123 thanks for input, I am really not familiar with those fields, learn a lot from you!
HI, Is this issue still open for contributing ??
/unaasign
@ME-ON1 you can go ahead, I am busy on other things.
@ggreenway
@rulex123 are you thinking of working on this? There definitely seems to be some interest in it, but no one has stepped up to take it on yet. I'm happy to advise, review, etc.
@jmarantz happy to help out with this. I see you already started work on a streaming implementation for stats.. do you have a rough plan in mind already? What should I focus on?
Oh great! Do you want to ping me on Slack tomorrow and we can chat? I can outline what needs to be done.
Oh great! Do you want to ping me on Slack tomorrow and we can chat? I can outline what needs to be done.
I can ping you on Monday and we can catch up then, if that works for you?
https://github.com/envoyproxy/envoy/pull/24998 I think had closed this issue that that had to be reverted, so leaving this open.
The problem with the algorithm used to accelerate rendering non-prometheus stats (StatsRequest) is that the tag-extraction groups are composed of scalar stats that cross scope boundaries. For example, cluster_default_total_match_count
exists in scope cluster.forward_cluster
and scope cluster.local_hugo_serve
. This is from @zirain 's test yaml referenced in https://github.com/envoyproxy/envoy/issues/27173.
To fix this, rather than flushing out the stats after each block of scopes with the same name we need to hold onto them until we are certain their can be no more additions to the group in alphabetically later scopes. E.g. we may have to hold onto all the name-groups with a "cluster." prefix until we get past all the clusters. I'm not certain this will be enough, as tag-extraction regexes are independent of the scope naming hierarchy. Also, cluster
sub-scopes may be the majority of all stats, so if we have to build a map containing the counters from all scopes beginning with cluster.
that is not that much cheaper than building the map for all stats.
I still think there's room to improve the current impl, but it will be more divergent than from the algorithm used to stream scalar stats for other stats formats, and I'm not sure how large the improvement can be but it is risk-free and an easy win.
Another way to look at this problem is that when we discover a new tag-extracted name, we can then find all other stats with the same tag-extracted name across all scopes. The data structures are not optimized for that, but maybe they could be if we want to invest effort and memory in this. One idea would be to have an alternate AllocatorImpl which could be instantiated instead of the current one in deployments with lots of stats and the desire to render them in Prometheus (e.g. by command-line flag). An alternate AllocatorImpl could add a map<tag-extracted-name, set<Counter*>>
(not keeping a strong reference) and would be responsible for clearing Counter* from the sets when those counters are freed from the allocator (e.g. during xDS update dropping a cluster). And cleaning map entries pointing to empty sets. We would not want to pay the memory or overhead of keeping this accurate this unless Prometheus was used, so it would need to be a flag.
Another way to achieve that would be for the allocator to offer a delete-hook and create-hook API, so that the a Prometheus context could maintain its TagExtractedName -> Set
Either way, this would make the prometheus formatter faster, I think, but not necessarily take less memory, since we'd be maintaining similar structures all the time as the ones we are creating temporarily now. Since this memory burst happens every 5 seconds maybe it's better to maintain the structures than periodically rebuild them.
If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately.
Title: One line description Prometheus stats handler used too much memory.
Description:
Repro steps:
Admin and Stats Output:
Config:
Logs:
Call Stack: