Closed jmarantz closed 6 years ago
Also many of the regexes in source/common/config/well_known_names.cc have required prefixes (e.g. ^http ^mongo ^cluster) and we could structure them as a map<string,vector<regex>>
and we only apply the regexes that match the first token. For those that don't we could have a list of unprefixed regexes that would need to applied universally.
@jmarantz I would recommend following up with @mrice32 who wrote all this code. If it looks like the regex parsing is now a large part of startup time, I'm guessing there are some low hanging fruit optimizations we could consider such as the ones you mention.
will do, thanks!
We could definitely apply the map<string,vector<regex>>
optimization.
It's also possible to turn tags off by setting use_all_default_tags to false and not including any custom tags. However, maybe people would prefer a more obvious way to disable tags? Turning off the regexes might also be a good way to tell what our max perf improvement will be in fixing this.
As for the canonical representation, I agree that the best solution would be to embed the structure of the tags in the instantiation of each stat if this is becoming a performance bottleneck. Let's chat offline about this @jmarantz.
I pushed an experimental commit that helps quite a bit.
https://github.com/jmarantz/envoy/commit/b1f25cf3dde8de38651df18e5f0a0021f37b126f
Filter regex lookups by prescanning the regex input string and quickly ruling out inputs that can't match.
This results in a ~3x speedup on the 10k cluster test from https://github.com/envoyproxy/envoy/issues/2063 . Though it still takes too long to startup IMO.
This shouldn't really help at all; I didn't convert to a map because I was unsure of what to enter into the map. I could do an upper-bound/lower-bound type search though. But the optimization in this commit is more of a peephole tweek and I'm surprised it helped at all. I wonder if this speeks to how suboptimal std::regex is for our application.
I have two higher level thoughts:
And some data from the 1k cluster example from bug 2063, run with callgrind, and with exit(0) injected prior to the listener loop in main_common.cc
ScopeImpl::counter called 61k times. ScopeImpl::gauge called 10k times. Stats::extractTag called 1.56M times. std::__detail::Executor<...some regex thing...> called 11M times
Note that in this run, extractTag skips the regex match if the prefix doesn't match.
I've done some more digging by instrumenting a map and some timers, indicating which regexes take a lot of time. This table measures the microseconds consumed parsing specific regexps. The action taken reflects whether the regex matched ("success") was executed and failed ("miss") or was skipped by the hack I described above ("prefix-discard", "embedded-discard").
These are semi-sorted by hand :)
microseconds consumed action name
-------------------------------------------------------------------------------
2993691: miss envoy.grpc_bridge_method
2920885: miss cipher_suite
2903371: miss envoy.grpc_bridge_service
1042328: miss envoy.http_conn_manager_prefix
337193: success envoy.cluster_name
182114: miss envoy.response_code
134511: miss envoy.response_code_class
35603: embedded-discard envoy.response_code
35804: embedded-discard envoy.response_code_class
28141: prefix-discard envoy.dyanmo_table
28698: prefix-discard envoy.dynamo_operation
34606: prefix-discard envoy.dynamo_partition_id
33137: prefix-discard envoy.fault_downstream_cluster
34684: prefix-discard envoy.http_user_agent
28725: prefix-discard envoy.listener_address
33139: prefix-discard envoy.mongo_callsite
32518: prefix-discard envoy.mongo_cmd
32695: prefix-discard envoy.mongo_collection
31976: prefix-discard envoy.mongo_prefix
33859: prefix-discard envoy.ratelimit_prefix
28443: prefix-discard envoy.ssl_cipher
29455: prefix-discard envoy.tcp_prefix
33770: prefix-discard envoy.virtual_cluster
33987: prefix-discard envoy.virtual_host
33601: prefix-discard envoy.clientssl_prefix
508: miss envoy.dyanmo_table
508: miss envoy.dynamo_operation
520: miss envoy.dynamo_partition_id
490: miss envoy.fault_downstream_cluster
511: miss envoy.http_user_agent
104: miss envoy.ssl_cipher
152: success envoy.http_conn_manager_prefix
62: miss envoy.listener_address
10: prefix-discard envoy.cluster_name
10: prefix-discard cipher_suite
4: prefix-discard envoy.grpc_bridge_method
6: prefix-discard envoy.grpc_bridge_service
9: success envoy.listener_address
14: success envoy.response_code_class
5: miss envoy.cluster_name
My suspicion is that these slow regex lookups are due to regex backtracking (background: https://regular-expressions.mobi/catastrophic.html?wlr=1) and may be resolvable by tweaking the regexps to be pickier (e.g. less use of "." in the regex, and more specific character sequences that specifically exclude known separators).
progress update: I have some changes I'm testing now which get the startup time for the 10k-cluster config from 30 seconds to about 5 seconds, and I think I can trim that to about 2-3 seconds.
The changes entail:
$1 throw away one token
$c1 capture one token
$* throw away many tokens greedily
$c* capture many tokens greedily
I
I have gotten this to start up in 2.5 seconds, and I think this is probably where I'll pause for now, though there is still opportunity for improvement, and even the changes I have so far will really required multiple PRs, which will be forthcoming as I clean them up.
The aggregated (and not-ready-for-review) changes are on https://github.com/jmarantz/envoy/commits/prescan-regex
@mattklein123 @htuch @alyssawilk can one of you assign this to me?
I have a more detailed implementation plan in a 2-page google doc
@jmarantz is there remaining work on this or should we resolve as complete (for now)?
Per discussion in other perf issue, closing this as fixed. We can open new issues as needed. Awesome work!!!
Description: See https://github.com/envoyproxy/envoy/issues/2063 and the 10k testcase. In that testcase, with https://github.com/envoyproxy/envoy/pull/2358 applied, the 30 second startup time is dominated by regex code associated with tags.
I did a little debugging to try to understand the tags, and I still don't. But my naive observation is that we are applying a bunch of regexes to stat names to do this transform (in Counter& ThreadLocalStoreImpl::ScopeImpl::counter():
My feeling is that this is a structural transformation that could be done much faster if the name structure is retained in the stat structures, rather than flattening the string in stats and then rediscovering. But I don't have enough of a feel for whether this alone would make startup fast, or whether the effort to fix would be worth instant startups on huge numbers of clusters.