mattklein123 commented 6 years ago

Needed for:

Stats backends that expect precomputed histograms
Per-host latency outlier detection
Per-host latency load balancing

mattklein123 commented 6 years ago

I think @jmarantz might pick this up. I'm going to brain dump my research on this so far into this issue in the next few days for further discussion.

jmarantz commented 6 years ago

sgtm

mattklein123 commented 6 years ago

I've done a bunch of research and here is what I have figured out so far. At a high level we have the following decision points:

Shared memory histogram or not shared memory histogram.
Type of histogram to use. From my research the types of histograms roughly fall into three categories. a) Fixed bucket widths. b) Log linear bucket widths. c) HDR.

Each of the above decision points have very different tradeoffs in terms of memory usage, accuracy, ease of multi-threading, and general ease of programming.

Here is my current thinking on the above, but @jmarantz it would be great if you want to own a more complete analysis of the tradeoffs and what we should think about doing.

Shared memory vs. not

The pros of a shared memory implementation are that we would have correct histograms across hot restart. The cons of the shared memory implementation is that we are unlikely to be able to use existing libraries and the programming model will be more complicated, especially if we want to do something like HDR. My current opinion is that the extra complexity of shared memory for histogram data is not worth it and we should just assume that histogram values in the old process that is shutting down will get dropped. I don't think this will really matter that much from an operational perspective.

Type of histogram and programming model

In general, the accuracy of fixed bucket width histograms are going to be fairly poor, unless we think about somehow tuning the buckets on a per metric basis. From my research it seems it would make more sense to use log linear or HDR. Of these, the libraries that seem most promising are:

log linear: https://github.com/circonus-labs/libcircllhist
HDR: https://github.com/HdrHistogram/HdrHistogram_c

Both of the above implementations support merging. This means that we could easily have per-thread histograms which then get aggregated during a flush pass (say every 60s). In the case of log linear, per thread aggregation is demonstrated here: https://github.com/circonus-labs/libcircmetrics

Note that t-digest has also been mentioned to me (e.g., https://github.com/tdunning/t-digest) but I don't know much about it).

I don't yet have a strong opinion on HDR vs. log linear. It seems that HDR will require substantially more memory per histogram, but yield much more accurate results. To reduce memory usage, we could ask the user of the histogram to set reasonable bounds and help estimate the memory usage. The log linear implementation I think would be far less accurate at larger values than HDR. I think a more complete analysis needs to be done the tradeoffs.

Envoy programming model

Today, histograms are driven off the stats interfaces as well as sometimes as part of the stat creation macros: https://github.com/envoyproxy/envoy/blob/master/include/envoy/stats/stats_macros.h. It's quite possible that in order to make this work well, the user of each histogram will need to set bounds (especially if we use HDR). This may mean that we need to remove the histogram option from the stat macros and have users create histograms manually off of a stat scope. The reality is that this isn't that big of a deal as there aren't that many histograms in the code. But this needs more analysis in terms of how the user will interact with histograms and potentially set bounds.

Stat flushing model

Thought also needs to be given to where histogram computation lives. E.g., do we compute for all stat sinks even though statsd does not want the computations? What type of aggregation/rotation interval do we use? If it fixed for all histograms? (E.g., rotate every 60s). Is it configurable per histogram? Do we eventually want multiple aggregation intervals? (E.g, 60s, 5m, 15m). In general simpler is always better but it's worth thinking through all of this for future use cases.

@jmarantz I'm sure I forgot something above but hopefully this is enough to get started and we can discuss further as you dig in.

mattklein123 commented 6 years ago

Another implementation that we use at Lyft IIRC: https://github.com/statsite/statsite/blob/master/src/cm_quantile.h. cc @theatrus ^

theatrus commented 6 years ago

The t-digest implementation in Veneus (https://github.com/stripe/veneur#approximate-histograms / https://github.com/stripe/veneur/tree/master/tdigest ) is probably the best reference - the Go is fairly transliterable to C/C++ (I'll even volunteer).

mattklein123 commented 6 years ago

@theatrus can you also give us your opinion on the tradeoffs above re: the different algorithms? Especially as it relates to Envoy?

jmarantz commented 6 years ago

If there is some value in histogram continuity across hot-restarts, is it an option to consider keeping the active structure in process-local memory and periodically serialize to shm or other persistent store, deserializing on hot restart? Also a SIGTERM/SIGHUP triggered attempt to save the latest state may make sense.

mattklein123 commented 6 years ago

If there is some value in histogram continuity across hot-restarts, is it an option to consider keeping the active structure in process-local memory and periodically serialize to shm or other persistent store, deserializing on hot restart? Also a SIGTERM/SIGHUP triggered attempt to save the latest state may make sense.

Having continuity would be optimal, but intuition tells me that the operational benefits are not worth the extra code complexity. As you point out, there are options beyond actually doing a histogram in shared memory. For example, the new process could ask for the old process histograms serialized via a hot restart RPC. Personally, I would recommend that we decide on all the other things first (programming model, which algorithm to use, etc.) and then make the call on shared memory/consistent or not?

theatrus commented 6 years ago

I'd also caution against a histogram in shared memory - histograms are inherently more expensive to update, and it can become a point of contention at high update rates. Of course, the same goes for merging-at-read, since locking is required on the thread local structures as reading can't be done atomically. One solution I've used in the past here is an A/B slot system: it costs 2x the memory, but allows soft transitions between one actively-updating and one actively-read-then-zeroed histogram structure. This is likely premature optimization without profile data to back it up.

As for the algorithms, I'd pass on log-linear. Its simple but is likely to have worse performance at the extreme bounds of quantiles (I don't have data on this exact implementation on hand to back up these claims, just experience from previous approximate libraries using a similar construct). I'd focus on HDR and T-Digest - I can spend some cycles using some production data to find error bounds.

postwait commented 6 years ago

Lossless, mergeable histograms will allow robust statistics afterwards. Trying t-digest, replacement, or quantiles is suboptimal. Histograms a undoubtedly more expensive than counters, but we're talking about <10ns to update then, so if you're tracking req/resp latencies it is completely satisfactory -- if you're tracking system calls or hot loops, then perhaps untenable.

Log-linear has precisely quantifiable error that allows robust statistics to be performed. That error exists in all approaches, but not all approaches provide easy mathematical reasoning. HDR is acceptable (as it is effectively a generalization of log-linear. The key there is that the HDR would need to be configurable such that an operator can set parameters that allow for robust statistical reasoning post-collection. T-Digest has compromises that are completely unnecessary.

HDR is really hard to get right (parameters/settings) and by customizing it you lose mergeability, which is a tragic compromise. Log-linear is a solid compromise that (in the linked implementation) guarantees 0-5% error margin on the bottom of the log range and 0-0.502% on the top of log range. The error itself (which is reasonable for large sample sets) is fixed and can be cascaded forward through statistical calculations (like quantile).

I'd recommend storing a cumulative llhist (or configurable HDR) and possibly a observer-resettable histogram as well. Though the latter isn't strictly necessary it can make for simpler monitoring integration.

postwait commented 6 years ago

For context that I realize might be missing given review of the conversation...

I all of these methods are fairly good a producing quantiles (noting some largers issues in T-digest around variable arrival rates). The point is that we'd like histograms, not quantiles. Quantiles are some arbitrary (though useful) statistical function that can be calculated on a set (and a histogram), but they are quite limited. We don't even use quantiles for service levels most of the time. Modern analytics tools need the histograms themselves to do more robust post-collection analysis.

My interest in the internal representation here is not because I want quantiles calculated a certain way, but rather that I wish it to not be internal. I would like to get (non-opaquely) the full distribution represented in the histogram out of envoy so that more complex modeling and richer analytics (and alerting) can be performed.

ramaraochavali commented 6 years ago

@jmarantz are you working on this? if yes, any idea on when we can expect this?

jmarantz commented 6 years ago

Actually I'm not; I'm looking more at perf related stuff.

I do have a candidate shared-memory histogram implementation I've been used in a different project, but I don't think it meets the dynamic bucket sizing requirements stated above. It does, however, handle concurrent updates from multiple processes, and has a fair amount of mileage on it. https://github.com/apache/incubator-pagespeed-mod/blob/c1f4d063cdc219d08fe07babcd617e2c8c1a2069/pagespeed/kernel/sharedmem/shared_mem_statistics.h#L96. However it might have mutex contention performance issues depending on the frequency of update. Alternative strategies of per-thread buffering and periodic merging may work better.

There are alternative implementations linked in the bug that (I think) have more of the desired functionality, but no shared-memory support.

ramaraochavali commented 6 years ago

@jmarantz ok.thanks for the details. those are very useful.

postwait commented 6 years ago

Also, https://github.com/circonus-labs/libcircmetrics has a latency optimized, contention-coping histogram implementation in it as well. We use it all over the place for high-frequency histogram collection in multi-threaded apps. It basically just stored metrics (including as histograms) and exposes them. It's thin with few dependencies. All histogram insertions are sub 10ns.

mattklein123 commented 6 years ago

I'm also favoring using libcircmetrics per @postwait at this point. I don't think we need shared memory support for the histograms. @ramaraochavali if you do want to work on this feature let's do a proper design doc before coding (gdoc which lays out goals, proposed changes, etc.) as this is a pretty complicated one.

ramaraochavali commented 6 years ago

@mattklein123 i looked at it briefly yesterday and felt that shared memory histogram is very complicated and it is not worth the complexity here. if i get to work on this, sure, will create a design doc.

@postwait are you talking of STATS_TYPE_HISTOGRAM_FAST when you mentioned libcircmetrics has latency optimized histogram? Do you have sample usage patterns?

postwait commented 6 years ago

The STATS_TYPE_HISTOGRAM and STATS_TYPE_HISTOGRAM_FAST are both good options. The "fast" version has a more stable runtime as the overall bin count grows. To be clear, they are both fast and they both have CPU fanout to reduce cross-thread contention.

stats_recorder_t *rec = stats_recorder_alloc();
stats_ns_t *global = stats_recorder_global_ns(rec);
stats_handle_t *hist = stats_register(global, "latency", STATS_TYPE_HISTOGRAM_FAST);

uint64_t start_ns = getns();
// do something
stats_set_hist_intscale(hist, getns() - start_ns, -9, 1);

You can also add a double using stats_set_hist(...), but if you already have a measurement in integer form using the _intscale variant will avoid a lot of unnecessary floating point math.

It supposed nested namespaces for better organization. The API is both small and unsophisticated.

To get them out in JSON, just use stats_recorder_output_json

ramaraochavali commented 6 years ago

@postwait thank you so much for the usage patterns. So is getting the output using stats_recorder_output_json is the only way to get the output? I looked at the api and looks like there is no other way. Am I missing some thing here? or is there a simpler way to get the histogram values out? Another question is there a way to configure buckets?

postwait commented 6 years ago

@ramaraochavali in the current API, yes. It does that to ensure a minimal locking pattern on the underlying tree. It would be quite simple to expose specific histogram operations as well. What did you have in mind? Getting at a raw histogram isn't often that helpful. As one of the maintainers of libcircmetrics, I'm happy to introduce APIs to solve commonly needed use cases.

As for configuring the bins, libcircmetrics uses (libcircllhist) fixed log-linear histograms, so there is no configuration required. This has the distinct advantage of ensuring the property of mergeability across independently collected histograms: one of its main advantages.

ramaraochavali commented 6 years ago

@postwait awesome. thanks. I am still going through the details to figure out if any additional apis are required. I am definitely not suggesting to get the raw histogram but was thinking about an api like getSnapshot which would return a structure with different bin values some thing similar to this. Does it make sense? Please do not make any change now but will let you know if some thing of that sort is required.

postwait commented 6 years ago

@ramaraochavali Oh! Those aren't histograms.. those are "math on histograms." The super frustrating part of those APIs is that once you get that single (or small set of) math operation, the rich histogram data is gone. It would certainly be possible to expose those statistical calculations on the underlying histograms; we just need to agree on what the API should look like. If you want a copy of the histogram object on which you can do math and expose that math, that should be quite easy... a getCopy that returns a histogram_t on which you can inquire percentiles and means, etc. using the underlying libcircllhist API:

https://github.com/circonus-labs/libcircllhist/blob/master/src/circllhist.h#L186-L194

To be fair, I am suggesting we get the raw histograms; this is why the default exposure is the whole histogram in JSON. Therein lies all the value. If you can expose the raw histogram, I can calculate anything I wish and I'm not limited to a few percentiles and a mean and variance.

ramaraochavali commented 6 years ago

@postwait Great. Thanks for the clarification. Looks like getting a copy of it using getCopy and doing the statistical math ourselves seems to be an option. Will definitely let you know if we need those higher level statistical calculation apis.

ramaraochavali commented 6 years ago

@mattklein123 Given the shared histogram complexity and our general inclination to libcircmetrics at this point, how about the following approach?

Manage these histograms at Sink. We could have these histograms defined at Sink layer.
Whenever, onHistogramComplete on Sink is called, we push the value to corresponding histogram.
We can introduce flushHistograms method on Sink, that can be called along with the other two flush methods based on stats flush interval.
In the flushHistograms we can get the histogram values out and push it to the sink.
For Statsd sink, since they do not need these, it can be a no-op. Only downside of this seems to be that we have histograms at two places (original place) and in sink. Given that the number of histograms are very few, is this a big deal? This seems to be simple to implement. Not sure if I am missing any thing obvious here. WDYT?

For /admin/stats, we will have to see if there is a way to expose the histogram data from sinks?

ramaraochavali commented 6 years ago

@mattklein123 one thing I see as a problem with above approach is it expects at least one sink to exist for use cases like

Per-host latency outlier detection
Per-host latency load balancing. Will think about how to address these, but want a very high level opinion on the above.

mattklein123 commented 6 years ago

@ramaraochavali I'm mostly on leave for the next 3 weeks or so. I will get back to you but it will be delayed as I need to find time to think through this in depth. Thank you for your patience.

ramaraochavali commented 6 years ago

@mattklein123 no worries. I am aware that you are on vacation, so not expecting responses and am perfectly ok with delayed responses :-)

ramaraochavali commented 6 years ago

@mattklein123 when you get back to work from leave, can you please take a look at the above approach and see if it makes sense?

mattklein123 commented 6 years ago

Yes @ramaraochavali I will do this next week.

mattklein123 commented 6 years ago

Sorry for the delay here:

Manage these histograms at Sink. We could have these histograms defined at Sink layer.

Yes agreed.

Whenever, onHistogramComplete on Sink is called, we push the value to corresponding histogram.

Yes, but we should do this without locking. There should be a histogram per thread. We can then merge the histograms during the flush interval.

We can introduce flushHistograms method on Sink, that can be called along with the other two flush methods based on stats flush interval.

Yes, during flushing we should latch all of the thread local histograms and collect them. This will require locking, and for perf reasons potentially atomic swaps of two histograms (active/collecting).

In the flushHistograms we can get the histogram values out and push it to the sink.

Yes, we will need to decide on how we summarize the histograms and ultimately expose them. Which percentiles will we show in the stats output?

For Statsd sink, since they do not need these, it can be a no-op.

Yup.

Only downside of this seems to be that we have histograms at two places (original place) and in sink. Given that the number of histograms are very few, is this a big deal?

I don't follow. What do you mean?

For /admin/stats, we will have to see if there is a way to expose the histogram data from sinks?

During flush, we will collect histograms into a summarized histogram object, which can then be printed on demand, for data during the last flush cycle.

Per-host latency outlier detection

For use cases like this, I would suggest not worrying about it for now. We can have additional histogram implementations in the locations that require them outside of the stat sink architecture.

postwait commented 6 years ago

On Mar 14, 2018 7:32 PM, "Matt Klein" notifications@github.com wrote:

In the flushHistograms we can get the histogram values out and push it to the sink.

Yes, we will need to decide on how we summarize the histograms and ultimately expose them. Which percentiles will we show in the stats output?

Can we also expose the whole histogram in the stats? There are tsdbs that support histogram data types, so for those, reducing the rich histogram to a few percentiles would be a shame.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1965#issuecomment-373209481, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUfhBQ-P1A2QpUjPR2g0BH404H0vt7Yks5teaiQgaJpZM4QKJXH .

mattklein123 commented 6 years ago

Can we also expose the whole histogram in the stats? There are tsdbs that support histogram data types, so for those, reducing the rich histogram to a few percentiles would be a shame.

Yup sounds reasonable to me. Can you help with design of that? We should potentially also treat that is an additional incremental feature.

ramaraochavali commented 6 years ago

@mattklein123 thanks. great. On

Only downside of this seems to be that we have histograms at two places (original place) and in sink. Given that the number of histograms are very few, is this a big deal?

I was saying that the code that actually uses the histogram will have the Histogram metric object that holds the raw single value and the "Actual" histogram would be at sink layer.. we can ignore this. not a real concern.

On the atomic updates and concurrently updating the histograms, I thought per @postwait 's comment above " https://github.com/circonus-labs/libcircmetrics has a latency optimized, contention-coping histogram implementation in it as well" - we do not need per thread histogram and merging logic, we could rely on libcircmetrics for that. Is that not correct? @postwait what do you think here?

And on stats sending the rich histogram as is, I think the Promoheus Metric Proto has a facility to send rich histogram data. We can do that for metric service. In /admin/stats we could have summarized view to start with and add presenting rich histogram later incrementally. Other stat sinks we could add as needed.

ramaraochavali commented 6 years ago

@mattklein123

Which percentiles will we show in the stats output?

I think we should start with P50, P75, P90, P99 and P999 - what do you think?

postwait commented 6 years ago

p0, p25, and p100 are also very useful (min/max). with those (and the p75) you can do quartile-based candlestick graphs.

On Thu, Mar 15, 2018 at 10:15 AM, Rama Chavali notifications@github.com wrote:

@mattklein123 https://github.com/mattklein123

Which percentiles will we show in the stats output? I think we should start with P50, P75, P90, P99 and P999 - what do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1965#issuecomment-373390154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUfhJD0phLJGAZIVu6_V8jdnuYEsshMks5tendugaJpZM4QKJXH .

mattklein123 commented 6 years ago

On the atomic updates and concurrently updating the histograms, I thought per @postwait 's comment above " https://github.com/circonus-labs/libcircmetrics has a latency optimized, contention-coping histogram implementation in it as well" - we do not need per thread histogram and merging logic, we could rely on libcircmetrics for that. Is that not correct? @postwait what do you think here?

Based on when I last looked at the code, I don't think we should use https://github.com/circonus-labs/libcircmetrics as I don't think the TLS model used there is going to be easily compatible with Envoy TLS, and some of the lock-free stuff that I would eventually like to do here. I would use https://github.com/circonus-labs/libcircllhist directly, and use libcircmetrics as an implementation guide. Each Envoy TLS sink should have an independent histogram, and then merging can be done as part of flushing. Unfortunately, how I would probably end up doing this from a locking/synchronization perspective is a bit complicated and I'm not sure I can fully describe it without going and doing it. I would just get it working using whatever locks you need, and we can optimize it later.

I think we should start with P50, P75, P90, P99 and P999 - what do you think?

Agreed, and let's also add P0, P25, P100 per @postwait.

postwait commented 6 years ago

I agree with this. If you already have a strong TLS model within the application related to performance, you're likely to better accommodate the primitives in libcircllhist better. envoy managing a thread-local histogram and later combining them (with an atomic-pivot for example), would be an excellent implementation.

On Thu, Mar 15, 2018 at 12:15 PM, Matt Klein notifications@github.com wrote:

On the atomic updates and concurrently updating the histograms, I thought per @postwait https://github.com/postwait 's comment above " https://github.com/circonus-labs/libcircmetrics has a latency optimized, contention-coping histogram implementation in it as well" - we do not need per thread histogram and merging logic, we could rely on libcircmetrics for that. Is that not correct? @postwait https://github.com/postwait what do you think here?

Based on when I last looked at the code, I don't think we should use https://github.com/circonus-labs/libcircmetrics as I don't think the TLS model used there is going to be easily compatible with Envoy TLS, and some of the lock-free stuff that I would eventually like to do here. I would use https://github.com/circonus-labs/libcircllhist directly, and use libcircmetrics as an implementation guide. Each Envoy TLS sink should have an independent histogram, and then merging can be done as part of flushing. Unfortunately, how I would probably end up doing this from a locking/synchronization perspective is a bit complicated and I'm not sure I can fully describe it without going and doing it. I would just get it working using whatever locks you need, and we can optimize it later.

I think we should start with P50, P75, P90, P99 and P999 - what do you think?

Agreed, and let's also add P0, P25, P100 per @postwait https://github.com/postwait.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1965#issuecomment-373433639, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUfhKmDnuYOdJZp6naVOYcc-rfs6lLiks5tepOxgaJpZM4QKJXH .

ramaraochavali commented 6 years ago

@mattklein123 so looks like we have agreement on the approach and the issues. I will start putting up PR and let us how it goes. I definitely need your help in the locking aspect. We can discuss on that in the PR.

@postwait thanks for your inputs here. As we discussed earlier on this issue, would you be OK to provide those statistical computation functions/apis for p90 etc on the histograms here https://github.com/circonus-labs/libcircllhist? We could add those functions at envoy but I am thinking would it be better/usable for others if we add it in your lib?

postwait commented 6 years ago

It is there, last function in the header.

//! Approiximate n quantiles of all values stored in the histogram //! \param q_in array of quantiles to comute //! \param nq length of quantile array //! \param q_out pre-allocated array where results shall be written to API_EXPORT(int) hist_approx_quantile(const histogram_t , double q_in, int nq, double *q_out);

q_in would just be [0, 0.25, 0.5, 0.75, 0.95, 0.99, 0.999, 1] and nq would be 8.

On Mar 15, 2018 11:30 PM, "Rama Chavali" notifications@github.com wrote:

@mattklein123 https://github.com/mattklein123 so looks like we have agreement on the approach and the issues. I will start putting up PR and let us how it goes. I definitely may need your help in the locking aspect. We can discuss on that in the PR.

@postwait https://github.com/postwait thanks for your inputs here. As we discussed earlier on this issue, would you be OK to provide those statistical computation functions/apis for p90 etc on the histograms here https://github.com/circonus-labs/libcircllhist? We could add those functions at envoy but I am thinking would it be better/usable for others if we add it in your lib?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1965#issuecomment-373593097, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUfhPNsPcbjOdD1luzI5G5vPSwCJACmks5tezHjgaJpZM4QKJXH .

ramaraochavali commented 6 years ago

@postwait thanks.

mattklein123 commented 6 years ago

@ramaraochavali yup SGTM. Feel free to get started. If you are stuck on anything reach out and I can look at a WIP branch. Thank you!

postwait commented 6 years ago

You'd want to make sure that libcircllhist is insalled. That appears to be a linker error... make sure that you extern "C" the include (as it is a C library).

Also make sure you're linking against it... -lcircllhist

On Sun, Mar 18, 2018 at 10:00 AM, Rama Chavali notifications@github.com wrote:

@postwait https://github.com/postwait i am trying to include this and build with envoy and I am getting the following error in Mac Sierra Xcode 9.

Undefined symbols for architecture x86_64: "hist_alloc()", referenced from:


Envoy::Stats::Metrics::MetricsServiceSink::MetricsServiceSink(std::__1::
shared_ptrEnvoy::Stats::Metrics::GrpcMetricsStreamer const&) in
libmetrics_service_grpc_lib.lo(grpc_metrics_service_impl.o)
"hist_insert(histogram*, double, unsigned long long)", referenced from:
Envoy::Stats::Metrics::MetricsServiceSink::onHistogramComplete(Envoy::Stats::Histogram
const&, unsigned long long) in libmetrics_service_grpc_lib.
lo(grpc_metrics_service_impl.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see
invocation)
Target //source/exe:envoy-static failed to build

Do you know if it compiles on Mac and if any special flags need to be passed?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/envoyproxy/envoy/issues/1965#issuecomment-374001425>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAUfhG5Kgo5ahmRfamH1FjqTff8k3uiAks5tfmhmgaJpZM4QKJXH>
.

ramaraochavali commented 6 years ago

@postwait thanks, was able to get the envoy build going with libcircllhist, integrated with one latency histogram , ran few requests and was able to extract the stats out of it. Will reach out to you if I need any further help.

ramaraochavali commented 6 years ago

@postwait one small help. I am trying to merge the histograms collected per thread in to single histogram with the following code

 for (auto const& histogram : histograms_) {
       const auto histogram_value = merged_histograms_.emplace(std::make_pair(histogram.first, hist_alloc()));
       histogram_t *hist_array[1];
       hist_array[0] = histogram.second;
       hist_accumulate(histogram_value.first->second, hist_array, 1);
       hist_clear(histogram.second); //clear the data after merge
    }

The above loop runs for all the threads. Merged histogram is declared as std::map<std::string, histogram_t*> histograms_; After few runs, the code fails with the assertion Assertion failed: hist_bucket_cmp(tgt->bvs[tgtidx].bucket, src->bvs[srcidx].bucket) == 0), function internal_bucket_accum, file external/com_github_circonus_labs_libcircllhist/src/circllhist i.e. at line and some times it fails at line. Can you please review the above accumulation code and see if some thing is wrong?

ramaraochavali commented 6 years ago

@postwait never mind. Looks like I found the issue after looking at the code. The target histogram should always be fresh. We can not merge in to existing histo. Is that correct? After I made that change, i stopped getting those asserts.

postwait commented 6 years ago

You should be able to accumulate onto an existing histogram.

More than likely it is a locking problem. This histogram manipulation is all thread-unsafe and the called has to manage exclusive access.

On Fri, Mar 23, 2018 at 6:41 AM, Rama Chavali notifications@github.com wrote:

@postwait https://github.com/postwait never mind. Looks like I found the issue after looking at the code. The target histogram should always be fresh. We can not merge in to existing histo. Is that correct? After I made that change, i stopped getting those asserts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1965#issuecomment-375619040, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUfhGe6BQGRWGMZ4hEox8GnkB9nlJYcks5thNFYgaJpZM4QKJXH .

ramaraochavali commented 6 years ago

ok. Thanks. I have not implemented the locking properly at all. Will try that and let you know after that.

mattklein123 commented 6 years ago

@ramaraochavali as promised, here is a breakdown of exactly how I would recommend implementing this. It's complicated, but it will get us directly to the optimal solution, so I think it's worth just doing it now.

Overview

Per discussion in the current PR, I thought about it, and I think we should just keep running histograms regardless of the stat sinks in use. If perf really becomes an issue we can potentially add a server config option to disable it later. For now, let's just do it everywhere as I think it makes everything that follows simpler.
Conveniently, @mrice32 already made Histogram into a first class proper object in the store. Thus, histograms are created, stored in the central store, have TLS representation already, etc. This makes our work much simpler.
Because we want the histograms to be available everywhere, and because Histogram is already a first class thing, we can just use the existing Histrogram interface and implementation to achieve our goals (with some modifications).
Our primary goal is to make https://github.com/envoyproxy/envoy/blob/master/source/common/stats/thread_local_store.h histogram storage aware. Ultimately, we should also make IsolatedStoreImpl implement stored histograms, but I don't think there is any real reason to do that right now so I would put in a TODO. (In the future we can use this on a per-host basis potentially for per-host latency, etc., but there are locking issues to consider there so I would not bother for now).

Locking and flushing overview for the thread local stat store

Here is how we should handle locking and flushing at a high level:

Each Histogram implementation will have 2 parts. 1) A "main" thread parent and 2) a per-thread collector.
Worker threads will write into their per-thread collector, without needing any locking when doing so.
The flush process works like so:
- The main thread server starts the flush process by posting a message to every worker which tells the worker to swap its "active" histogram with its "backup" histogram. (Thus each TLS histogram has 2 histograms it makes use of, swapping back and forth).
- When all workers have done this, a message is posted back to the main thread to "continue the flush."
- On the main thread, we know now that there is no possibility of any worker writing into the "backup" histogram.
- The main thread can now go through all histograms, collect them across each worker, and create final total and summary views for each histogram.
- These total views and summary views can then be flushed along with the counters and gauges. Additionally, the summary views can be stored for later access via the admin handlers in between flush intervals.

Implementation details

We will start in ThreadLocalStoreImpl::ScopeImpl::histogram(). Notice how currently this function mimics creation for counters and gauges. Basically, it tries to find a histogram in a TLS slot, and if not allocates one in the central store, and then stores it in TLS storage.
We will modify ThreadLocalStoreImpl::ScopeImpl::histogram() to do the following:
- Create a HistogramParentImpl and a HistogramPerThreadImpl
- HistogramPerThreadImpl is stored in the TLS storage. HistogramParentImpl is stored in main thread storage
- If TLS storage already has a HistogramPerThreadImpl just return it
- If not, see if a HistogramParentImpl needs to be created. If so, create one.
- Now, create a HistogramPerThreadImpl, add it to a list stored by the parent, and store it in TLS, then return it to the caller.
- Note that we are not doing anything with shared memory here. We just assume that histograms are only local to the process.
Thus, with the changes to ThreadLocalStoreImpl::ScopeImpl::histogram() we now are setup to have a central histogram implementation with per-thread collection capabilities.
At this point the implementation of the rest of the system is relatively straightforward. I will outline each of the steps at a high level.
HistogramPerThreadImpl has 2 histograms, as I mentioned before. An "active" and a "backup". Values are always written into "active", but during flushing "active" and "backup" will swap. (Basically write to active through a pointer or index).
HistogramParentImpl has a list of all the per-thread histograms which we can iterate through when we need to. This will need to be done under central store lock, but this will not block high throughput TLS writes into the histograms.
We will then modify InstanceImpl::flushStats() to perform our 2-stage flush operation.
- When this routine is called, call a method in the store called something like collectAllHistograms(). Internally, the store will post a message to all workers, which will perform the histogram swap, and when the last worker has checked in, posts back to the main thread. (You might consider adding some function to do this to the ThreadLocal system directly).
- When the main thread post returns, the lock can be re-acquired and all of the histograms can be collected as previously discussed.
- Then the existing contents of InstanceImpl::flushStats() can be run.
- This gets us the best situation for everyone. Sinks that don't care about the summary/complete histograms can just ignore them. Additionally, the data is now available if Envoy is hit for one of the stats admin endpoints.

Conclusion

Alright, that's a wall of text, and I'm probably missing something. But that's roughly how I would implement it and I think we should track towards the above. I'm happy to help out with things as you start to implement. @mrice32 can you also take a look to see if you see any issues with the above plan? cc @jmarantz also in case you have any thoughts.

Thank you!

ramaraochavali commented 6 years ago

@mattklein123 will try this approach and let you know how it hoes. Will reach out to you if I need some help.

mrice32 commented 6 years ago

@mattklein123 that design looks great to me - should be more than enough to work towards. @ramaraochavali, would you mind adding me on your PR(s) related to this? I'd be interested to take a look :). Also, cc @trabetti, who's working on #2736 and may want to take advantage of this once it's done.

envoyproxy / envoy

stats: add built-in log linear histogram support #1965

Shared memory vs. not

Type of histogram and programming model

Envoy programming model

Stat flushing model

Overview

Locking and flushing overview for the thread local stat store

Implementation details

Conclusion