LiveRamp / HyperMinHash-java

Union, intersection, and set cardinality in loglog space
Other
54 stars 13 forks source link

support custom Q or R or P of BetaMinhash #14

Open yuhongye opened 4 years ago

yuhongye commented 4 years ago

Is there any plan to support custom Q or R or P of BetaMinhash?

sherifnada commented 4 years ago

Hi @yuhongye

There are currently no plans to support custom Q/R/P values for BetaMinhash. For that usecase, we recommend using HyperMinHash. Would you mind sharing a little bit about any usecases you have for which you'd like this feature?

Best, Shrif

yuhongye commented 4 years ago

We want to take the intersection of multiple sets and be as accurate as possible, so we need a larger p. I tested the accuracy of HyperMinHash and BetaMinHash in calculating the jaccard index, and found that BetaMinHash is better in the case p=14, q=6, r=10. In my knowledge, the accuracy of these two algorithms should be the same, so I went to the source code to find the difference between them, and found four different implementations:

  1. calculate the position of the left-most 1-bit, BetaMinhash code is:

    private static short getLeftmostOneBitPosition(boolean[] bits, int p, int q) {
    int _2toTheQ = (1 << q);
    
    // Notice: I think offset should be p, not p + 1, becaulse index start at 0
    int offset = p + 1; 
    for (int i = offset; i < _2toTheQ + offset; i++) {
      if (bits[i]) {
        return (short) (i + 1 - offset);
      }
    }
    return (short) (_2toTheQ + 1);
    }

    and hyperminhash is

   final long zeroSearchSpace = (hllHash << p) | (long) (1 << (p - 1));
   final int leftmostOnePosition = Long.numberOfLeadingZeros(zeroSearchSpace) + 1;

I think BetaMinhash has a bug, I marked it in the comments of the code.

  1. calculate the r bits, BetaMinhash get the rightmost r bit of hash[1], and HyperMinHash get the leftmost r bits of hash[1]

  2. when calculate c in the algorithm 2.1.4 in the paper, BetaMinhash compare the whole register, but HyperMinhash only compare the mantissa.

    
    // BetaMinhash
    for (BetaMinHash sketch : sketches) {
    itemInIntersection = itemInIntersection &&
      firstSketch.registers[i] == sketch.registers[i];
    }

// HyperMinhash for (HyperMinHash sketch : sketches) { itemInIntersection = itemInIntersection && firstSketch.registers.getMantissaAtRegister(i) == sketch.registers.getMantissaAtRegister(i); }



4. BetaMinhash use the expectedCollision algorithm in the paper when calculate similarity, but HyperMinHash does not.

I want to know the impact of these differences, Thank you! (Hope you can understand my english expression.)