auto sharding strategy for theta sketch

apache / pinot

Apache Pinot - A realtime distributed OLAP datastore

https://pinot.apache.org/

Apache License 2.0

5.26k stars 1.23k forks source link

auto sharding strategy for theta sketch #9437

Open patelprateek opened 1 year ago

patelprateek commented 1 year ago

I was going through the pr : https://github.com/apache/pinot/pull/5316 Can you please point me to how or where is this implemented. How do we define high cardinality threshold

I am running into issues where different sets can be different cardinality and error is high and wanted insights on how to tune theta params during my indexing phase . what is a reasonable theta threshold to decide high cardinality

Jackie-Jiang commented 1 year ago

Here is the doc for the function: https://docs.pinot.apache.org/configuration-reference/functions/distinctcountthetasketch You may also learn more about theta sketch here: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

There is one parameter that can be passed in the function: nominalEntries. By default it is set to 4096, and you may try a higher value to get better accuracy (performance will be worse)

patelprateek commented 1 year ago

May be my question wasn't clear. I understand what theta sketches are , but trying to understand how you build auto sharding for some high cardinality segments when constructing theta sketch , what is considered high cardinality , what thresholds ? IIUC intersection(theta_sketch(a) , theta_sketch(b)) can have high error rate when jaccard similarity is low or difference between cardinality of A and B sets are big , so you also shard the bigger set to have size smaller . Trying to understand better on how is this sharding implemented

Jackie-Jiang commented 1 year ago

The implementation for this support is in the DistinctCountThetaSketchAggregationFunction class. With the current implementation, we don't shard the set. I think this can be a good optimization, and we need some research to decide a cardinality threshold to shard the set. We can also consider providing this threshold as a parameter to the function. Do you want to help contribute this feature?