ascv / HyperLogLog

Fast HyperLogLog for Python.
MIT License
101 stars 19 forks source link

Incorrect estimation of large cardinalities #50

Open tatrats opened 1 day ago

tatrats commented 1 day ago

First, thanks for this great library ! However, I noticed a typo that causes very large cardinalities to be underestimated.

Amplitude of the error (using $2^{20}$ registers): The relative error is around $10^{-6}$ for a count of $2^{35}$, but it grows with larger cardinalities: counting $2^{40}$ elements underestimates by around $2^{-10}$, and for $2^{45}$ the error is large, around 10%.

The culprit: function clz Function clz has a typo that causes it to return 24 when it should return 25 or 26. Here is the typo at line 1151 of src/hll.c.

         if (x >= (1UL << 48UL)) {
             shift = (x >= (1UL << 56UL)) ? 56 : 48;
         } else {
-            shift = (x >= (1UL << 38UL)) ? 40 : 32;
+            shift = (x >= (1UL << 40UL)) ? 40 : 32; // Comparison must be with 1 << 40 instead of 1 << 38
         }
     } else {
ascv commented 1 day ago

Thank you for identifying this issue. Fix is on #51 .