Open patelprateek opened 1 year ago
Here is the doc for the function: https://docs.pinot.apache.org/configuration-reference/functions/distinctcountthetasketch You may also learn more about theta sketch here: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html
There is one parameter that can be passed in the function: nominalEntries
. By default it is set to 4096, and you may try a higher value to get better accuracy (performance will be worse)
May be my question wasn't clear. I understand what theta sketches are , but trying to understand how you build auto sharding for some high cardinality segments when constructing theta sketch , what is considered high cardinality , what thresholds ? IIUC intersection(theta_sketch(a) , theta_sketch(b)) can have high error rate when jaccard similarity is low or difference between cardinality of A and B sets are big , so you also shard the bigger set to have size smaller . Trying to understand better on how is this sharding implemented
The implementation for this support is in the DistinctCountThetaSketchAggregationFunction
class.
With the current implementation, we don't shard the set. I think this can be a good optimization, and we need some research to decide a cardinality threshold to shard the set. We can also consider providing this threshold as a parameter to the function.
Do you want to help contribute this feature?
I was going through the pr : https://github.com/apache/pinot/pull/5316 Can you please point me to how or where is this implemented. How do we define high cardinality threshold
I am running into issues where different sets can be different cardinality and error is high and wanted insights on how to tune theta params during my indexing phase . what is a reasonable theta threshold to decide high cardinality