ascv / HyperLogLog

Fast HyperLogLog for Python.
MIT License
99 stars 19 forks source link

Add intersection and similarity #30

Open erp12 opened 5 years ago

erp12 commented 5 years ago

First of all, this library is great. Thank you!

I wanted to suggest two features found in some other HLL libraries that (as far as I can tell) are missing here. It is possible to compute an estimated cardinality of the intersection between two HLLs and an estimated similarity between two HLLs.

Here is a Go library that focuses on these two features: https://github.com/axiomhq/hyperminhash

It would be awesome to have C implementations of these operations exposed to python as part of the HyperLogLog object.

hll1.intersection(hll2)
hll1.similarity(hll2)
ascv commented 5 years ago

These features seem fine though I will need to more research on implementations for computing intersection cardinality.

kuk commented 2 years ago

Seems like there is no need for special implementations for computing intersection cardinality

cardinality(A | B) = cardinality(A) + cardinality(B) - cardinality(A & B)
-> cardinality(A & B) = cardinality(A) + cardinality(B) - cardinality(A | B)

For cardinality(A | B) you have already implemented .merge method.

ascv commented 2 years ago

@kuk

HyperLogLogs do not store the set, which is why they are memory efficient, so you cannot compute expressions like cardinality(A ∪ B) or cardinality(A ∩ B) with the data available.

Substituting the registers of the HyperLogLog for these expressions (e.g. cardinality(A ∪ B) --> cardinality(registers(A) ∪ registers(B))) doesn't work because the registers only store the maximum leading zero count observed. There is no way to know what values were previously observed which would be needed to compute an intersection or union of the sets.

For example:

Let H = a HyperLogLog 
Let A = {1, 2, ..., 2^32}
Let R = registers of H

Suppose k=4. Then max(cardinality(R)) = 2^4 since there are 2^4 registers which can at most store 2^4 distinct values. However cardinality(A) = 2^32 so it follows that cardinality(A) > max(cardinality(R)) and therefore cardinality(A) != cardinality(R).

kuk commented 2 years ago

... cannot compute expressions like cardinality(A ∪ B) ...

Wait, then what .merge is for? I thought is it for computing cardinality(A ∪ B)

ascv commented 2 years ago

Yes but this isn't done using a union operation. Instead there is a register by register comparison and the maximum value is the winner. Then the cardinality estimation of the merged HyperLogLog approximates the true cardinality of the union of the sets.

ascv commented 2 years ago

I should add that because you take the maximum value estimating the cardinality of a union is straightforward since the algorithm only needs the maximum values. However for intersections the maximum value doesn't give you sufficient information to accurately and reliably estimate the intersection cardinality. Of course, this is an area of active research so that could change in the future.

kuk commented 2 years ago

Knowing cardinality of sets union you can find cardinality of sets intersection. There is a formula in math:

cardinality(A ∩ B) = cardinality(A) + cardinality(B) - cardinality(A ∪ B)

For cardinality(A ∪ B) we have .merge, for cardinality(X) we have .cardinality.

ascv commented 2 years ago

And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get. Now extending this example, what happens when you use this approach on 1000 sets? Even if the error is low it is easy to see how the error quickly compounds to produce inaccurate results.

This was why my original preference was to "wait and see" and let people implement intersection estimates themselves. In the case of two sets, using inclusion-exclusion principle, it is trivial to do:

card_a = a.cardinality()
card_b = b.cardinality()
a.merge(b)
card_ab = card_a + card_b - a.cardinality() 

All this being said, you're not the first person to ask for an intersection feature. Maybe I should re-consider my original stance since "perfect is the enemy of good." If we used inclusion-exclusion, I would need to put a disclaimer in the README regarding the accuracy of this method.

kuk commented 2 years ago

And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get

Oh, yes good point, did not consider nonoptimal error estimate

jianshu93 commented 7 months ago

Hello All,

I think the answer is not to use that math equation above (also in the Dashing paper) for Jaccard index, which is know to have much larger variation than MinHash. the joint MLE estimator for cardinality in the Ertl 2017 paper is the most accurate, a well written paper with example code in c++, a C version can be derived from my perspective. FYI the Ultraloglog (https://arxiv.org/abs/2308.16862) is out with new FGRA estimator, even more space efficient while more accurate than the improved estimator and raw estimator, the closest one to the Camera-Rao lower bound (MLE method meets the lower bound but very slow, see paper for details). Should be very useful if we have a C-python version to use. The best for set similarity is MinHash (more recent is the optimal densification MinHash), the much smaller variance than whatever HLL based Jaccard, limited by the sqrt(1.079/m) RMSE bound.

Thanks,

Jianshu

ascv commented 3 months ago

@jianshu93 Thanks for bringing to my attention this paper. I gave it a quick skim and it looks promising. I think it would be best implemented as a separate class (e.g. UltraLogLog). This project is basically a side hobby of mine that I work on very infrequently so I can't make any promises. The current plan is to fix #46 and then I'll look into the UltraLogLog implementation.