Replace Datasketch HLL with newer Datasketch CPC sketch

apache / druid

Apache Druid: a high performance real-time analytics database.

https://druid.apache.org/

Apache License 2.0

13.3k stars 3.66k forks source link

Replace Datasketch HLL with newer Datasketch CPC sketch #7269

Open pdeva opened 5 years ago

pdeva commented 5 years ago

Motivation

the datasketches library has a new Unique Counting Sketch called CPC sketch that has better accuracy per size than HLL.

https://github.com/DataSketches/sketches-core/releases

Proposed changes

replace Datasketch HLL sketch with CPC or offer it alongside as a higher accuracy sketch

Rationale

Better accuracy

Operational impact

The docs of datasketches don't describe the whether CPC algorithm is more CPU intensive or not. This will determine whether we want to completely deprecate DataSketches HLL sketch and replace it with CPC or keep CPC as an additional option.

gianm commented 5 years ago

We would need to look into whether CPC has an offheap implementation yet. And I think we'd rather add than replace. And I don't want to say anything further until someone expert in datasketches chimes in :)

AlexanderSaydakov commented 5 years ago

There is no off-heap mode in CPC, and we doubt it would even make sense for this algorithm. Is it possible to not requite having a buffer aggregator?

pdeva commented 5 years ago

@AlexanderSaydakov is there any documentation on the algorithm? i couldnt find it on the datasketch website. is it more cpu intensive than HLL?

AlexanderSaydakov commented 5 years ago

We are working on the documentation. It is a new algorithm that beats HLL in terms of accuracy per byte (more compact (when serialized) with the same accuracy or more accurate with the same size or both). It might be a bit slower, especially merging. So it is not automatically better for all use cases. In particular, there is no fixed-size mode at all like HLL_6 and HLL_8 (HLL_4 has no hard limit).

djiangc commented 3 years ago

Any progress on this front?

AlexanderSaydakov commented 3 years ago

If by "this front" you mean CPC sketch aggregator in Druid, then a simple answer is "no". The requirement of operating in a fixed chunk of memory that is preallocated by a BufferAggregator is quite limiting. Some sketch algorithms are more friendly to this mode of operation, some other algorithms have to be coerced quite a bit, and can sometimes exceed a given limit, which leads to moving a particular sketch out of a given off-heap slot. This happens with quantiles sketch, for example. This mode of operation can change all sorts of trade-offs in an algorithm and make it essentially a different algorithm. We are not sure how to approach this with CPC sketch. There is still no off-heap implementation of it.

gianm commented 3 years ago

@AlexanderSaydakov If we had a mechanism in Druid for allocating additional memory at runtime, would there be anything else standing in the way of doing a CPC sketch in Druid? It's something that I think we'll be adding at some point. I can't predict exactly when, but, there's a variety of reasons why we'd want to do so, and they seem compelling enough that it's likely to happen.

AlexanderSaydakov commented 3 years ago

I am not sure about additional memory. Perhaps it would make sense to let aggregators to allocate their own memory to begin with, and not necessarily single contiguous chunks per aggregation.