apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
878 stars 206 forks source link

Interface differences #184

Closed tlindener closed 6 years ago

tlindener commented 6 years ago

Hi, first thanks a lot for this amazing library! Really appreciated.

I'm currently wondering about the interface of the HllSketch. Is there a design choice in using Generics in the ItemsSketch but not in the HllSketch? Similarly, the ThetaSketch uses a Builder Pattern while others do not. Would be great to know about these design choices!

leerho commented 6 years ago

The DataSketches library has multiple families of sketches. Builders simplify combinatorial explosion of construction options and are only useful when there are a lot of options for a given class. When a class has only one or two options, a builder is not necessary and would just add clutter to the API.

For Unique Count Estimation: The ThetaSketch family is big family with multiple family members and options: QuickSelect, Alpha, Compact, Union, Direct vs Heap, etc. Because of all the options, a builder tool is really necessary.

The HLL and HLL MAP families are very simple with only one or two public classes and limited options so a builder isn't really needed.

The Unique counting sketches accept as input all the primitives and strings, but not objects. If you wish to count unique complex objects, you will need to provide your own GUID, digest or hash of your object that has at least 64 bits of entropy. Note that the Java hashCode() is only 32 bits and by the Birthday Paradox, there is a 50% chance of a duplicate of about one in 2^16, which in very large data sets could be a significant source of estimation error. Because of this we do not provide a generic version of the unique count sketches.

For Frequency and Heavy Hitter Estimations The Frequency family of sketches returns a list of items (defined by the user) that constitute the Heavy Hitters of the stream. Internally, the ItemsSketch sketch keep a "sample" of the items it has seen. Thus, by definition, to be really useful it needs to be generic. There is also a streamlined version of this sketch for frequency estimation of long values. The frequency sketches have only one input parameter so no builder is required.

For Estimating Distributions The Quantiles family allows you to extract the PMF, CDF, and rank distributions of items (defined by the user) from a stream. The only property that is required of these items is that they are comparable. A getQuantiles(...) query returns an ordered list of items from the stream. Again, to be really useful, this sketch needs to be generic. A streamlined doubles version of this sketch is also available. The ItemsSketch only requires two parameters. Although there are builders for the DoublesSketch and the DoublesUnion they are not really required.

For Uniform and Weighted Sampling These sketches must be generic by definition. They have few construction options thus no builders are necessary.

Associative Unique Counting The Tuple family is quite complex and has multiple generic classes as well as builders.

Other Specialized Applications Sketches in the other repositories (Android, Pig, Hive, and Vector) may have builders and or generic forms as a function of their intended application and complexity.

tlindener commented 6 years ago

Thanks a lot! Especially the part about the hash function is good to know!