addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 556 forks source link

consider support for additional hll(p) register sizes (currently only 5 is used) #89

Open tea-dragon opened 9 years ago

tea-dragon commented 9 years ago

(this issue takes over for #88 after clarification)

There is a trade off between bit usage and ultra-high cardinality accuracy. The current HLLP implementation uses 5 bits per register similar to HLL. The benefit is that it is an easier comparison between the two, and more efficient for most use cases. However, iirc, the original paper does recommend going up to 6 bits regardless, and it is not terribly difficult to modify the register size to be a non-constant (other than serialization format concerns).

Additionally, although the lower cardinality space is fairly well covered by the sparse set representation, it is also possible that there may be benefit to allowing an even lower register size. This may work even better if some kind of additional, secondary dynamic switch is supported. eg. "SPARSE -> NORMAL_4 -> NORMAL_5 -> NORMAL_6" or something. The runtime performance may be tricky to do well in that case though.

The easiest solution is to add a config parameter and somehow deal with serialization issues.