Closed jianshu93 closed 1 year ago
Hi Jianshu,
the Shannon entropy of SetSketch is given by $$\frac{m}{\ln(2) \ln(b)} \left(\left(1-\frac{1}{b}\right)+\int_0^1 z^{1/(b-1)} \frac{(1-z) \ln(1-z)}{z \ln(z)} dz\right).$$ This is the number of bits required to store a SetSketch when using optimal compression.
For $b=2
$ (which asymptotically corresponds to HyperLogLog) the Shannon entropy per register (divided by m) is 2.83 (see here). This is the reason why it is possible to reduce the space significantly compared to uncompressed HyperLogLog with 6 bits per register. Theoretically, a reduction by 53% could be achieved. HyperLogLogLog reports 40%.
For $b=1.001
$ the Shannon entropy per register is 12.24 (see here). Compared to an uncompressed register size of 16 bits, there is not as much room for improvement as for HyperLogLog. Theoretically, a 23% reduction would be possible, which is probably difficult to achieve in practice and introduces additional complexity.
Otmar
Hello Etmar,
Thanks for the detailed entropy equation, which I was not able to find it in the main SetSketch paper. It seems for the b=0.001 case, for set similarity comparison (the locality sensitivity section), we do not have too much space to improve. We are processing billions of elements per set, for millions of sets. Can setsketch also be applied for weighted element, like those in BagMinHash, just wondering it can be applied to Jp like index in probminhash.
Thanks,
Jinashu
Hello Otmar,
I am comparing SetSketch locality sensitivity section with the B-bit One Permutation MInHash with optimal densification for large set. It seems for very small Jaccard, like around 0.005 or smaller, variance of setsketch with default settings, b=0.001 and m=4096, RMSE is quite large, e.g. 3% or so, decreasing b to 0.0001 improves RMSE but much slower (increase m also). While for densification ,with the same sketch 4096, since It is very close to the theoretical lower bound J(1-J)/m=0.00000121=0.000121%, much smaller (see also figure 4 last panel of the optimal densification paper). I care about small J like 0.005 because it is very important in biological applications, especially estimating genomic distance. I understand that densification scheme is only for MinHash, and was seldomly used for cardinality estimation. There are advantages in setsektch but just wondering how to further reduce variance for smaller jaccard without losing speed, the mentioned densification scheme is very fast. Another question is the densificaiton idea is not mergeable right, just want to confirm that we cannot use it for distributed environment.
Thanks,
Jianshu
Hi Jianshu,
sketching for very small Jaccard similarities is generally very difficult. Neither SetSketch nor b-bit Minwise hashing are good options for this. Conventional minwise hashing or faster alternatives like One Permutation Hashing, Fast Similarity Sketching, SuperMinHash (see here for a Java implementation) are better suited for small Jaccard similarities. When comparing the RMSE, note that J(1-J)/m is the variance, so the RMSE is sqrt(J(1-J)/m). Btw, the SetSketch paper also describes a more accurate estimation approach for MinHash which also involves the estimated distinct counts.
Otmar
Hello Otmar,
Thanks for the correction and explanation. This is helpful. I think for now, the best solution for us is to use setsektch locality sensitivity first (we combined with some extremely fast graph based nearest neighbors search library) to find top #k number of neighbors of interest out of a huge database, which is right now very fast, and then we use the more accurate estimation (Joint MLE) only for those top neighbors found because it is very expensive. This should solve our problem, being fast and also accurate for small Jaccard. In theory the sketches generated from setsektch locality sensitivity section can also be used for the Joint MLE right (if use the same parameters, b, m, a and q), despite the fact that estimator is quite different for Joint MLE.
We do not want to use MinHash or faster alternatives alone because space is also a problem for us, those Minhash-like ones consume much more memory than SetSketch in practice, e.g. One Permutation MinHash.
Thanks,
Jianshu
Hello Otmar,
I am a little bit confused about mergeability of various algorithms mentioned in the SetSeketch paper. Theoretically, MinHash is mergeable right, and also HLL/SetSketch. With the one permutation and densification idea applied, is it still mergeable since the densification introduce randomness into it, hashing the entire document VS. hashing several part of it and then merge. Not sure about this. This is related to whether we can use it in a streaming fashion, e.g. for thousands of genomes (AGCT txt document), if we want to hash the entire genome we need to wait until the entire genomes is loaded into memory. If it is mergeable, we can hash while reading and no need to wait, the hashed part will then released from memory, this will save a lot of memory since genome files can be large (5 Megabytes) and there are millions of them.
Thanks,
Jianshu
Hi Jianshu,
you should not merge after densification. However, you can use one permutation hashing without densification for the partial results. Then you can merge the partial results and only apply densification on the final result.
Otmar
ok thanks this is very helpful. I will definetely acknowledge you and use your suggestions.
Thanks,
Jianshu
Hello Ertl,
I am wondering whether the new hyperlogloglog idea mentioned here: https://dl.acm.org/doi/abs/10.1145/3534678.3539246, could be applied in SetSeketch/MLE in HLL (when b=0.001), after a quick reading, it seems they just compress the skethes further more, and allow overflow but handled the overflow very carefully. Space is even smaller for this idea (40% smaller). I have not read the code carefully but it seems all hyperloglog properties are preserved.
Looking forward to your reply.
Thanks,
Jianshu