Maximal estimate error - Githubissues

apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

https://datasketches.apache.org

Apache License 2.0

894 stars 209 forks source link

Maximal estimate error #328

Closed Cheappie closed 3 years ago

Cheappie commented 4 years ago

Hi, is there a formula for cpc or hll sketches to calculate max error for estimate ? I wonder what are the practices for assuring correctness(high accuracy) of execution while using sketches ?

Sketches, probablistic models for performing such operations as distinct counting looks great in both terms storage and performance, however crucial thing for me is whether there are hard boundaries for accuracy and error ?

jmalkin commented 4 years ago

Error is probabilistic, and for larger values is approximately Gaussian. Bounds estimates are provided in terms of 1, 2, and 3 standard deviations, with a decreasing probability of exceeding each of the bounds, respectively. That same trend persists indefinitely, but we don't quantify it.

AlexanderSaydakov commented 4 years ago

As Jon said, for these algorithms there are no hard bounds. There is a distribution of error similar to normal distribution with ever decreasing probability to have an estimate farther and farther away from the true value. The sketch can give you upper and lower bounds for three confidence intervals that roughly correspond to 1, 2 and 3 standard deviations of the normal distribution: roughly 67%, 95% and 99% confidence. There are formulas in the papers. Perhaps we need better documentation on the web site. We have an accuracy table for Theta sketch, but not for HLL and CPC yet.

Cheappie commented 4 years ago

Ok, now I got It. I was wondering whether I have missed certain information, or It is the way it works. Thank you guys. @AlexanderSaydakov that's a good point about documentation, for sure the other two sketches could benefit from having documentation as good as theta sketch.