Closed dyf6372 closed 1 year ago
The example code was Java, but since the behavior will be the same in this repo's C++ or Java we can just address this here.
An HLL sketch moves through up to 3 different modes over its lifetime. We start in what we call a "coupon collection" phase, which itself has 2 steps. The first is to store a list of coupons. That fairly quickly turns into a hash set. When the size of the hash table would exceed the size of the final HLL array, we transition into that final HLL mode.
Additionally, there are several estimators. The basic one is the composite estimator, which itself is a mix of several models. That one is order-independent. But we also have what is known as a HIP (Historic Inverse Probability) estimator, which provides more accurate estimates but at the cost of being order-dependent. That HIP estimator works as long as we have the inputs presented in order -- and we can actually be a little more general and remain in that mode as long as we are processing raw inputs or coupons.
Because the HIP estimator stores a little extra information, we can actually use that even when the sketch has transitioned into HLL mode as long as we have only fed in raw inputs or coupons. The ability to use the HIP estimator remains until we union two sketches already in HLL mode.
There are 3 ways to query a sketch's estimate:
When I looked at the set of input sketches, all of them were in SET mode. As a result, getEstimate() was using the order-dependent HIP estimator, which is why you saw varying results. Calling getCompositeEstimate() returned a stable result, but on average that will have slightly larger error. The choice of estimator here lets you make a trade-off between accuracy and order-independence.
This behavior is ultimately a feature, not a bug. Please feel free to ask if you have additional questions.
@jmalkin Thank you very much for your meticulous reply, the complete program is implemented in C ++, after finding this behavior, I wrote a small demo with Java to reproduce the same behavior.
We constructed multiple hllsketches, and then merged multiple hllsketches together using a union object. We found that the result of getEstimate is not stable if we shuffle the order of merging hllsketches, but the result of getCompositeEstimate is always consistent.
This is the demo program
And sample data
And output