Open pdeva opened 5 years ago
We would need to look into whether CPC has an offheap implementation yet. And I think we'd rather add than replace. And I don't want to say anything further until someone expert in datasketches chimes in :)
There is no off-heap mode in CPC, and we doubt it would even make sense for this algorithm. Is it possible to not requite having a buffer aggregator?
@AlexanderSaydakov is there any documentation on the algorithm? i couldnt find it on the datasketch website. is it more cpu intensive than HLL?
We are working on the documentation. It is a new algorithm that beats HLL in terms of accuracy per byte (more compact (when serialized) with the same accuracy or more accurate with the same size or both). It might be a bit slower, especially merging. So it is not automatically better for all use cases. In particular, there is no fixed-size mode at all like HLL_6 and HLL_8 (HLL_4 has no hard limit).
Any progress on this front?
If by "this front" you mean CPC sketch aggregator in Druid, then a simple answer is "no". The requirement of operating in a fixed chunk of memory that is preallocated by a BufferAggregator is quite limiting. Some sketch algorithms are more friendly to this mode of operation, some other algorithms have to be coerced quite a bit, and can sometimes exceed a given limit, which leads to moving a particular sketch out of a given off-heap slot. This happens with quantiles sketch, for example. This mode of operation can change all sorts of trade-offs in an algorithm and make it essentially a different algorithm. We are not sure how to approach this with CPC sketch. There is still no off-heap implementation of it.
@AlexanderSaydakov If we had a mechanism in Druid for allocating additional memory at runtime, would there be anything else standing in the way of doing a CPC sketch in Druid? It's something that I think we'll be adding at some point. I can't predict exactly when, but, there's a variety of reasons why we'd want to do so, and they seem compelling enough that it's likely to happen.
I am not sure about additional memory. Perhaps it would make sense to let aggregators to allocate their own memory to begin with, and not necessarily single contiguous chunks per aggregation.
Motivation
the datasketches library has a new Unique Counting Sketch called CPC sketch that has better accuracy per size than HLL.
https://github.com/DataSketches/sketches-core/releases
Proposed changes
replace Datasketch HLL sketch with CPC or offer it alongside as a higher accuracy sketch
Rationale
Better accuracy
Operational impact
The docs of datasketches don't describe the whether CPC algorithm is more CPU intensive or not. This will determine whether we want to completely deprecate DataSketches HLL sketch and replace it with CPC or keep CPC as an additional option.