Open tangyong opened 3 years ago
I'd like to revive this thread highlighting another advantage of using the DataSketches library which I came across recently.
Besides being well maintained, the library provides binary representation compatibility across implementations/releases, which makes it interoperable with external systems. An example usage might be a Spark pipeline generating intermediate data sketches before writing into Pinot. Pinot can serve queries with complex filters and eventually merge/intersect these sketches to produce estimates.
This can already be achieved in Pinot for cardinality estimation using ThetaSketch functions 👍 . I think a quantile implementation such as KLL and a Frequent Items sketch from this library would be great additions to complete the picture.
I'd be happy to give it a go if we can get a consensus on including them. @mayankshriv please let me know what you think.
cc @chenboat
Yes, there is interest from other community members on supporting other sketches from the DataSketches library. These can be added as separate aggregation functions (eg distinctCountDataSketch
, quantileDataSketch
, etc) and should be straightforward to add.
Created a proposal document to add some of these sketches: https://docs.google.com/document/d/1ctmKVRi67lpO6x1RYKDvDYf05EZx2Vbs2OnUudYP-bU/edit
I've added support for the CPC sketch in this PR which is still being tested: https://github.com/apache/pinot/pull/11774
The name of the query aggregation function has been chosen based on @cbalci 's proposal document above.
The following is the discussion with Mayank on slack:
Mark: Hi Team, I have seen that in 0.4.0, pinot has implemented the initial version of theta-sketch based distinct count aggregation function, utilizing the Apache DataSketches library. Compared to Druid the latest release which has also included DataSketches extension(Theta sketch, Tuple sketch, Quantiles sketch ,HLL sketch), pinot has any plan to implement other sketchs other than Theta sketch). Thanks.
Mayank: Pinot already supports HLL and TDigest based percentiles. If there's a specific case where you would find DataSketch based implementations more useful, we can definitely explore that. If so, would recommend filing an issue for that.
Mayank: For HLL we use com.clearspring.analytics.stream.cardinality.HyperLogLog,And for TDigest, we use com.tdunning.math.stats.TDigest
Mark: we maybe need to pay attention to KLL sketch vs t-digest(pinot impmentation) and seeing the following comparison by datasketches, https://datasketches.apache.org/docs/Quantiles/KllSketchVsTDigest.html
Mayank: Thanks for sharing @Mark.Tang. We can definitely explore adding these if needed.
Mark: appendix(https://github.com/apache/datasketches-website/blob/master/docs/pdf/DataSketches_deck.pdf): HLL
Also noting that DataSketches includes a latest CPC Sketch: Estimating Stream Cardinalities more efficiently than the famous HLL sketch, which is from https://arxiv.org/pdf/1708.06839.pdf