apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.44k stars 1.31k forks source link

Offer a relative-error sketch alternative to ContinuousSample #2078

Open alexmiller-apple opened 5 years ago

alexmiller-apple commented 5 years ago

ContinuousSample current builds a population of a defined size, and then keeps a random sample of that size of all elements it has seen, such that each element has an equal chance of being included in the sample. Quantiles are calculated by sorting the array, and looking at the value stored in a specific location in the array. (ie. the 99th element gives the 99th quantile in a 100 element array.)

This seems to be the simplest version of what is referred to as a rank-error quantile sketch in literature, as they error bounds they promise are on at what rank a value will be at instead of the error bounds on the value itself. For answering quantiles near the middle of a distribution, rank-error sketches have reasonably low error. For quantiles near 0 or 1, like reporting 99th percentile tail latency, it allows a reasonably large error bound on the value.

Also from VLDB '19 is a DDSketch, a paper on a relative-error quantile sketch. Relative-error sketches offer an error bound on the value rather than its rank, which makes tail latency quantiles more accurate. HDR Histogram is another relative-error sketch. Adopting one of these would improve the error bounds on what we report as our tail latency versus our true tail latency.

DataDog released their code for DDSketch, but there isn't yet have a C or C++ implementation, though it might be not hard to translate one from the Java implementation. HDR Histogram has a c implementation with a compatible license.

However, discussion with @kaomakino yielded a couple wise points as always:

  1. The instability of our performance tests from run-to-run will likely have more of an impact on the reported latency figures than the quantile sketch algorithm that's used
  2. The end result of reporting 99.99 percentile of 10s vs 20s will probably be the same, w.r.t paging or warranting investigation.
xumengpanda commented 5 years ago

I think the percentile value from our current performance test is the exact value from all of the measured experiments.

The DDSketch sounds like solving the problem: how to estimate the percentile value from sampled instead of all data.

In case other FDB components, say storage server (SS), wants to get the 99th percentile metrics, the algorithm may be useful. (I don't know if SS needs such data though.)

dongxinEric commented 5 years ago

I think the percentile value from our current performance test is the exact value from all of the measured experiments.

ContinuousSample will randomly sample the stream/sequence once the designated sample buffer is full, then randomly replace an existing entry. See here.

xumengpanda commented 5 years ago

I think the percentile value from our current performance test is the exact value from all of the measured experiments.

ContinuousSample will randomly sample the stream/sequence once the designated sample buffer is full, then randomly replace an existing entry. See here.

I see. This is the sampling within one performance test workload, which gives the sampled performance result, e.g., sampled latency.

ajbeamon commented 5 years ago

It looks like we use ContinuousSample a lot in our performance workloads and we also use it in our client-side TransactionMetrics event. We don't report many tail latencies in the client metrics (disregarding max, which seems to be tracked exactly by ContinuousSample) except for 90th and 98th percentile total latency.

I do agree that for many of our performance tests, we likely have other sources of noise that could make a change to the sampling technique not particularly noticeable, but if we have it as a goal to improve our stability we'd probably want to do something like this eventually.

For the client TransactionMetrics, it may also not be a huge immediate win with our currently chosen metrics. If we ever wanted to report different sorts of values or have a mechanism to merge the results of different clients, though, then this could be beneficial.

One other area that may benefit from this is the ability for us to calculate server-side request latency percentiles. We currently have tracking for request counts in different latency bands to support the ability to have SLO latency targets, but having the ability to track accurate percentiles could help answer related questions about the health of a cluster (for example, if you wanted your 99th percentile latency to be <10ms, the latency bands could tell you the percentile of 10ms while having a sample like this could tell us how close the 99th percentile latency is to 10ms). In addition to its standalone accuracy, I'm not sure how well the ContinuousSample can be merged compared to these other sampling approaches, and that would be a requirement for reporting cluster-wide statistics.

Looking at DDSketch, the implementation didn't seem particularly difficult such that we would want to avoid it for that reason.