support custom Q or R or P of BetaMinhash

yuhongye commented 4 years ago

Is there any plan to support custom Q or R or P of BetaMinhash?

sherifnada commented 4 years ago

Hi @yuhongye

There are currently no plans to support custom Q/R/P values for BetaMinhash. For that usecase, we recommend using HyperMinHash. Would you mind sharing a little bit about any usecases you have for which you'd like this feature?

Best, Shrif

yuhongye commented 4 years ago

We want to take the intersection of multiple sets and be as accurate as possible, so we need a larger p. I tested the accuracy of HyperMinHash and BetaMinHash in calculating the jaccard index, and found that BetaMinHash is better in the case p=14, q=6, r=10. In my knowledge, the accuracy of these two algorithms should be the same, so I went to the source code to find the difference between them, and found four different implementations：

calculate the position of the left-most 1-bit, BetaMinhash code is:

private static short getLeftmostOneBitPosition(boolean[] bits, int p, int q) {
int _2toTheQ = (1 << q);

// Notice: I think offset should be p, not p + 1, becaulse index start at 0
int offset = p + 1; 
for (int i = offset; i < _2toTheQ + offset; i++) {
  if (bits[i]) {
    return (short) (i + 1 - offset);
  }
}
return (short) (_2toTheQ + 1);
}

and hyperminhash is

   final long zeroSearchSpace = (hllHash << p) | (long) (1 << (p - 1));
   final int leftmostOnePosition = Long.numberOfLeadingZeros(zeroSearchSpace) + 1;

I think BetaMinhash has a bug, I marked it in the comments of the code.

calculate the r bits, BetaMinhash get the rightmost r bit of hash[1], and HyperMinHash get the leftmost r bits of hash[1]

when calculate c in the algorithm 2.1.4 in the paper, BetaMinhash compare the whole register, but HyperMinhash only compare the mantissa.


// BetaMinhash
for (BetaMinHash sketch : sketches) {
itemInIntersection = itemInIntersection &&
  firstSketch.registers[i] == sketch.registers[i];
}

// HyperMinhash for (HyperMinHash sketch : sketches) { itemInIntersection = itemInIntersection && firstSketch.registers.getMantissaAtRegister(i) == sketch.registers.getMantissaAtRegister(i); }



4. BetaMinhash use the expectedCollision algorithm in the paper when calculate similarity, but HyperMinHash does not.

I want to know the impact of these differences， Thank you! (Hope you can understand my english expression.)

LiveRamp / HyperMinHash-java

support custom Q or R or P of BetaMinhash #14