bmizerany / perks

Effective Computation of Things
BSD 2-Clause "Simplified" License
187 stars 54 forks source link

Low percentile values do not seem to be calculated correctly #20

Open homiak opened 5 years ago

homiak commented 5 years ago

Here are the percentile calculated using this library (first column) and percentiles calculated using github.com/montanaflynn/stats library (second column)

Target percentiles:

calcQuantiles = []float64{0.001, 0.01, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4, 0.50, 0.90}

time="2019-05-24T08:53:00Z" level=info msg="Total: 62748 samples"
time="2019-05-24T08:53:00Z" level=info msg="0.1th: 182, 187"
time="2019-05-24T08:53:00Z" level=info msg="1.0th: 4494, 2826"
time="2019-05-24T08:53:00Z" level=info msg="3.0th: 7982, 8024"
time="2019-05-24T08:53:00Z" level=info msg="5.0th: 11528, 11689"
time="2019-05-24T08:53:00Z" level=info msg="10.0th: 15766, 15519"
time="2019-05-24T08:53:00Z" level=info msg="20.0th: 19606, 19680"
time="2019-05-24T08:53:00Z" level=info msg="30.0th: 20909, 20915"
time="2019-05-24T08:53:00Z" level=info msg="40.0th: 21205, 21206"
time="2019-05-24T08:53:00Z" level=info msg="50.0th: 21347, 21350"
time="2019-05-24T08:53:00Z" level=info msg="90.0th: 21470, 21470"

time="2019-05-24T08:54:00Z" level=info msg="Total: 64905 samples"
time="2019-05-24T08:54:00Z" level=info msg="0.1th: 318, 187"
time="2019-05-24T08:54:00Z" level=info msg="1.0th: 4106, 2838"
time="2019-05-24T08:54:00Z" level=info msg="3.0th: 9910, 8039"
time="2019-05-24T08:54:00Z" level=info msg="5.0th: 11709, 11720"
time="2019-05-24T08:54:00Z" level=info msg="10.0th: 15785, 15536"
time="2019-05-24T08:54:00Z" level=info msg="20.0th: 19635, 19685"
time="2019-05-24T08:54:00Z" level=info msg="30.0th: 20962, 20917"
time="2019-05-24T08:54:00Z" level=info msg="40.0th: 21221, 21207"
time="2019-05-24T08:54:00Z" level=info msg="50.0th: 21351, 21351"
time="2019-05-24T08:54:00Z" level=info msg="90.0th: 21466, 21471"

time="2019-05-24T08:54:52Z" level=info msg="Total: 66775 samples"
time="2019-05-24T08:54:52Z" level=info msg="0.1th: 823, 187"
time="2019-05-24T08:54:52Z" level=info msg="1.0th: 3790, 2844"
time="2019-05-24T08:54:52Z" level=info msg="3.0th: 9398, 8046"
time="2019-05-24T08:54:52Z" level=info msg="5.0th: 11577, 11730"
time="2019-05-24T08:54:52Z" level=info msg="10.0th: 15679, 15538"
time="2019-05-24T08:54:52Z" level=info msg="20.0th: 19635, 19683"
time="2019-05-24T08:54:52Z" level=info msg="30.0th: 20907, 20917"
time="2019-05-24T08:54:52Z" level=info msg="40.0th: 21201, 21208"
time="2019-05-24T08:54:52Z" level=info msg="50.0th: 21346, 21352"
time="2019-05-24T08:54:52Z" level=info msg="90.0th: 21466, 21471"

Method used to log the above

    log.Printf("Total: %d samples", quantiles.Count())
    for _, quantile := range calcQuantiles {
        percentile, err := stats.Percentile(samples, quantile*100)
        if err != nil {
            log.Errorf("failed to calc percentile - %s", err)
        } else {
            log.Infof("%.1fth: %.0f", quantile*100, quantiles.Query(quantile), percentile)
        }
    }
   }

Please note a big difference in low percentile numbers between the 3 print outs above. Even though one would expect some differences at say 0.1th percentile because 3 print outs above contained different sample numbers. Still there is unexpectedly large variability with only a limited sample number change. Also notice that percentile calculated with github.com/montanaflynn/stats does not display the same variability, which rules out the possibility that the samples added were somehow very different to the previous lot.

So overall I think this library calculates low percentiles incorrectly or the method is not really suitable for this.

I have the data that was used for calculating the above and happy to share it. Just let me know if you want it. (it's less than 1Mb file)

frioux commented 5 years ago

The whole magic of this algorithm is that instead of calculating exact percentiles it adds an error (called epsilon) to allow more efficient recording of the summaries. The default in this library is 0.01, so a requested percentile of 0.01 is actually somewhere between zero and 0.02. You can request an epsilon of zero but I suspect the cost for that would be too high eventually.