apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
894 stars 209 forks source link

Update sketches will have large error rate in some cases. #318

Closed yunfan123 closed 4 years ago

yunfan123 commented 4 years ago

All of update sketch don't deal with duplicate datas. So in such situation:

val sketch1 = UpdateSketch()
sketch1.update(1)
sketch1.update(1)
sketch2.update(xxx)
val sketch2 = UpdateSketch()
sketch2.update(1)
sketch2.update(xxx)
anotb.update(sketch1, sketch2)
anotb.getResult()

Int this cases, because the number 1 add two times and only remove once, so it should counts in the final result. But actually it will not. It will cause very large error rate in small data.

jmalkin commented 4 years ago

This is not correct. Distinct counting sketches will count a value only once. You added it and then removed it, so the value will not be present after the AnotB operation.

jmalkin commented 4 years ago

Without additional context, these seem to be theta sketches. Those will count each item once, which is why they're distinct counting sketches or cardinality estimation sketches: They tell you how many different elements exist, but nothing about how often any specific element exists. There something of an overview here: http://datasketches.apache.org/docs/Architecture/MajorSketchFamilies.html

There is a Tuple sketch that can hold additional data associated with each element, but there's no default behavior for how those combine. You'd need to define your own summaries to try to create the behavior you want. Keeping in mind that these are probabilistic structure and that you can't expect exact results -- not even for individual elements -- as the data volume grows.