PerformanceMonitor 95,99th percentile stats contain significant Error windows

joercampbell commented 10 years ago

The performance monitor as it stands takes a sampled quantile from the total number of observed samples enabling the system to not have to store ALL the available samples but a much smaller number. This can potentially be inaccurate.

joercampbell commented 10 years ago

CUT AND PASTE FROM ANOTHER CONVERSATION:

So after we saw that bit of strangeness with 99th percentiles (i.e.: the a variance of about 4 seconds across our 5 nodes, which doesn't sync up at all with what we pulled out of Splunk when calculating the 99th percentile from the entire sample set for each node) I dug into the Jrugged code a bit.

So if I understand what Jrugged is doing it's roughly attempting to keep a set amount of samples from which quantiles are computed. Optionally, these set of samples can be constrained to a sliding window. Once the max number of samples is hit, we'll replace an existing sample, at random, some of the time. It looks like we're trying to make it less likely to replace an existing sample the more samples we've seen in our sliding window? (More on that a bit later).

Anyhow, for computing an arbitrary quantile, this seems like it's only valid if we can make some assumptions about the distribution of our samples (seems like this would only work well if the distribution of latencies is uniform…) Did anyone put some thought into that when this approach was originally written?

In addition, it looks like the way that the lifetime quantiles are being calculated is particularly strange. Take a look at line 226 here - https://github.com/Comcast/jrugged/blob/master/jrugged-core/src/main/java/org/fishwife/jrugged/SampledQuantile.java, it looks like that's attempting to make it less likely to add a new sample the more total samples we've seen (which makes some sense when its operating in windowed mode, I guess?), but the effect of that will also be that when used for the lifetime samples, it's going to become increasingly less likely to add a new sample the longer we collect samples because maxSamples / effectiveSamples is going to approach 0, while the randomly generated double will range between 0 and 1. This means we're actually giving old samples more "weight"? (I could be missing something here, but I doubt that's what was intended...)

To really solve this problem, I think we need a much fancier data structure. Something like this - https://github.com/HdrHistogram/HdrHistogram - is intended to do so. Don't have any experience with it yet, but I'll poke at it…

joercampbell commented 10 years ago

CUT AND PASTE FROM ANOTHER CONVERSATION:

Ah, this is probably something I originally wrote. The idea (especially for the lifetime window) is that you need a sample of all the requests that makes each request equally likely to be part of the sample, but without knowing in advance how many samples there are. If I have already seen N total requests, then including the next request in the sample with probability 1/N and then selecting one of the existing samples to evict accomplishes this.

The windowed versions are trickier because some of the samples "expire" when they fall out of window. I'd probably have to look at the code to see how that works. If I recall correctly, though, I think you keep the samples sorted by time and then see if the oldest fell out of window; if you have "room" for the new request then you include it in the sample. If not, then you apply the same algorithm above.

Jon

joercampbell commented 10 years ago

CUT AND PASTE FROM ANOTHER CONVERSATION:

Also, also, I don't think this relies on the distribution of latencies being uniform. If there is a difference between a 99th percentile calculated from a sample and a 99th percentile calculated over the whole sample size, I think that's expected to some degree by statistics. I'd also look specifically at the size of the configured sample sets for each window: if the sample size is only 100 requests, the 99th percentile is calculated against just one request. I could easily imagine that varying quite a lot against a full sample set 99th percentile.

Jon

joercampbell commented 10 years ago

CUT AND PASTE FROM ANOTHER CONVERSATION:

Hey Jon,

This whole thing is prompted by trying to use the 99th percentiles, and getting very bizarre results from a cluster of 5 nodes. The 99th percentiles across the cluster varied wildly (by about 2 seconds), but if we computed them from Splunk, split out by node, the 99th percentiles for each node were exactly the same. Point being, in it's current state, the numbers we pulled out of Jrugged were actually worse than nothing since they were so inaccurate, they almost led us to make the wrong conclusion about some Codebig latency issues…

One observation - the sample size here is fixed where as the population can grow without bounds, so I'd expect the percentiles that we compute (infer?) to become less accurate the more latencies we'd see. I'd have to dig up my statistics books here (this is a bit weird and I'm not sure how to even go about figuring out what a reasonable sample size should be, we're outside the realm of really familiar statistics with confidence intervals here…).

In this case, we have a sample size of 200 and a population of around 50,000. It's entirely possibly that that sample size is just too small, but again, since the population can grow without bounds for the lifetime metrics and the sample size is fixed, I don't think you can ever pick a reasonable sample.

I also have a suspicion we're not selecting a random sample (but I need to do some math to prove myself wrong :-).

MBL

joercampbell commented 10 years ago

CUT AND PASTE FROM ANOTHER CONVERSATION:

If the sample size is 200, then the 99th percentile will be computed as the lowest of the two highest latencies in the sample.

If the total population is 50,000, then let's say that's 10,000 per node. The 99th percentile of the total population will be determined by the lowest of the 100 highest latencies (more or less).

It seems pretty likely to me that a random sample of 2 requests from the 100 highest latencies is apt to be all over the place, especially in the usual "long tail" scenarios we see with our services.

The estimation error of the sampled 99th percentile against the population 99th percentile is definitely going to be affected by the latency distribution of the whole population. I think you're right that if this distribution was uniform between 0 and the maximum observed latency, that they'd be pretty close, generally. However, we usually don't have that kind of distribution.

P(sample 99th percentile is in the population 99th percentile) = P(choose 196 samples from the 0-98th percentile) * P(choose 2 samples from the 99th percentile) * P(choose 2 samples from the 100th percentile) =

0.98^196 * 0.01^2 * 0.01^2 = 1.9e-10

Not very likely they'd match exactly.

P(sample 99th percentile < population 99th percentile) = P(choose 199 or 200 requests from the 0-99th percentile) = 0.99^199 + 0.99^200 = 26%

So the sample 99th percentile will likely overestimate the population 99th percentile 3 times out of 4.

If we had 400 samples instead, then this would be: 0.99^400 + 0.99^399 + 0.99^398 + 0.99^397 = 7% chance of underestimating the population 99th percentile

If we had only 100 samples, this would be: 0.99^100 = 36%

So more samples decreases the likelihood you underestimate the 99th percentile with the sample. Arguably, you'd rather overestimate the 99th percentile than underestimate it.

Hope those are useful maths.

Jon

joercampbell commented 10 years ago

Jon, Do you think doing something like automatically growing the sampled window (from the default 200) based on the overall lifetime #of samples would be a way to control the %err introduced? We would have to have a cap so as not to grow it forever, but might be a good way with the current code to make the numbers more realistic.

Just musing on the code.

comcast-jonm commented 10 years ago

I think you get the same result by just setting the sample size to whatever your eventual cap would be, no? If you are willing to hold that many samples, then you might as well just collect that many and get a better estimate. Increasing your sample size should always give you better estimates. If anything, it suggests that perhaps we could bump up the default sample size, although that's always dicey because you don't know how many PerformanceMonitors someone will have running in their application and whether increasing the sample size would put memory pressure on them.

I'd be more inclined to see a digest of this discussion added to the documentation, perhaps with some guidelines for how much memory it costs to add each incremental sample.

joercampbell commented 9 years ago

I am more than willing to add the documentation - but do think that I want to provide users a way to say what percent error they are willing to live with when getting this data. Basically - allow a creator of a performance monitor to indicate how 'wrong' they are willing to be at the 95, 99th etc. and adjust the windowing to that. Does that make sense?

jonm commented 9 years ago

I think that does make sense, but I do not necessarily have enough faith in the statistics/probabilities I laid out above to claim we could pick a sample size given a desired confidence interval. I'm sure those calculations exist, but someone else needs to provide them. :) In the meantime, making sure people know that sample window size affects accuracy, as above, seems like a reasonable minimum step while we wait for a smarter method.

joercampbell commented 5 years ago

Ok - pull request up to possibly address this

Comcast / jrugged

PerformanceMonitor 95,99th percentile stats contain significant Error windows #29