Open yuhongye opened 4 years ago
Hi @yuhongye
There are currently no plans to support custom Q/R/P values for BetaMinhash. For that usecase, we recommend using HyperMinHash. Would you mind sharing a little bit about any usecases you have for which you'd like this feature?
Best, Shrif
We want to take the intersection of multiple sets and be as accurate as possible, so we need a larger p. I tested the accuracy of HyperMinHash and BetaMinHash in calculating the jaccard index, and found that BetaMinHash is better in the case p=14, q=6, r=10. In my knowledge, the accuracy of these two algorithms should be the same, so I went to the source code to find the difference between them, and found four different implementations:
calculate the position of the left-most 1-bit, BetaMinhash code is:
private static short getLeftmostOneBitPosition(boolean[] bits, int p, int q) {
int _2toTheQ = (1 << q);
// Notice: I think offset should be p, not p + 1, becaulse index start at 0
int offset = p + 1;
for (int i = offset; i < _2toTheQ + offset; i++) {
if (bits[i]) {
return (short) (i + 1 - offset);
}
}
return (short) (_2toTheQ + 1);
}
and hyperminhash is
final long zeroSearchSpace = (hllHash << p) | (long) (1 << (p - 1));
final int leftmostOnePosition = Long.numberOfLeadingZeros(zeroSearchSpace) + 1;
I think BetaMinhash has a bug, I marked it in the comments of the code.
calculate the r bits, BetaMinhash get the rightmost r bit of hash[1], and HyperMinHash get the leftmost r bits of hash[1]
when calculate c in the algorithm 2.1.4 in the paper, BetaMinhash compare the whole register, but HyperMinhash only compare the mantissa.
// BetaMinhash
for (BetaMinHash sketch : sketches) {
itemInIntersection = itemInIntersection &&
firstSketch.registers[i] == sketch.registers[i];
}
// HyperMinhash for (HyperMinHash sketch : sketches) { itemInIntersection = itemInIntersection && firstSketch.registers.getMantissaAtRegister(i) == sketch.registers.getMantissaAtRegister(i); }
4. BetaMinhash use the expectedCollision algorithm in the paper when calculate similarity, but HyperMinHash does not.
I want to know the impact of these differences, Thank you! (Hope you can understand my english expression.)
Is there any plan to support custom Q or R or P of BetaMinhash?