SetSketch implementations

nvanva commented 1 year ago

Hi! Thank you for this wonderful library. I am working on estimation of the overlap between different web crawls, this basically requires estimating the number of unique URLs in lists of 10-100 billion URLs and the cardinalities of their intersections. After some literature search it seems that SetSketch is the right method for this case as it allows both cardinality and Jaccard index estimation. I found several implementations of SetSketch in your library (ByteSetSketch, CSetSketch, FSetSketch, ShortSetSketch). Could you please give an advice how to select the appropriate one? Also I could not find how I can change the hyperparameters a,b from Python. Is it possible and reasonable to try selecting them, or better rely on the default values? The final question is about the calculation of the confidence intervals for the estimates. In the implementation of HLL there is the method relative_error() to get those, is there a way to get similar estimates for SetSketch?

dnbaker commented 1 year ago

Hi -

The SetSketch is a great option. I think it's the best way to sketch and compare unweighted sets.

In this library, we provide two implementations:

CSetSketch, which uses full floating-point registers without any approximation. (C stands for Continuous)
SetSketch, which uses quantized log-transformed floating-point values stored in integers.

The simplest approach is just to use CSetSketch directly, which is fast and is the most accurate.

If you want the additional speed that can come from using SetSketch by packing more data into fewer bytes, then you have two options.

You can choose a, b a priori, risking the possibility of one of your values lying outside of the range of values you're using. This works, and it lets you sketch in a smaller space rather than the 64-bit floating points. For instance, ByteSetS stores 8 bits per register instead.
You can compute a number of CSetSketch options, calculate the minimum and maximum values sampled, and then select a and b to minimize the approximation used.

For 1, we have some defaults for different sizes (ByteSetS, ShortSetS, UintSetS) which work well on plenty of applications. You could just run with that.

For 2, you would use optimal_parameters function, called on CSetSketch, to get the a and b parameters to use. Then for each CSetSketch, you would use the .to_setsketch()function with the a and b values selected to generate the compressed sketches. This allows you to maximize the use of the integral representations.

2 takes more space a priori but yields higher-accuracy comparisons. 1 takes less space and time, but is slightly less accurate.

From Python you have fewer options, at least at construction.

I suggest the e[bs]s_from_np functions. ess uses shorts (uint16) and ebs uses bytes (uint8). They take numpy arrays of uint32/uint64 items, and they come with the option to select the a and b parameters used.

It would be helpful if more was exposed to the python interface; when I get time, I will expose the to_setsketch function so you can build up a sketch using CSetSketch and then quantize it to a ByteSetSketch or ShortSetSketch.

And lastly for relative error, it would be helpful to add it.

The formula is:

sqrt{(ln(b) * (b + 1) / (b - 1)  - 1) / m}

For CSetSketch (b = 1), it's simply 1/sqrt(m). Then it increases slightly with higher b values.

I'll add this functionality to a future edition as well.

Thanks for the question, and don't hesitate to ask any more you might have!

Best,

Daniel

nvanva commented 1 year ago

Thanks a lot for your recommendations. I will use CSetSketch because currently it seems to me that the space is not an issue, but precision is important. I'm still wondering about the following questions: 1) How is CSetSketch related to the SetSketch paper (https://vldb.org/pvldb/vol14/p2244-ertl.pdf)? Or is it described in another paper? I could roughly match the code of SetSketch with the pseudo-code in the paper, but CSetSketch looks a bit differently. 2) You mentioned that the relative error for CSetSketch is 1/sqrt(m). If I understand correctly, this is for cardinality estimation and not jaccard estimation. As I undertood from the paper for jaccard the relative error should be similar or better to the error of MinHash: sqrt(J(1-J)/m). When using CSetSketch, can I use sqrt(J(1-J)/m) as an upper bound of the relative error?

dnbaker commented 1 year ago

The CSetSketch is the un-truncated SetSketch. If you set b = 1, you'll recover it. It's simpler to compute since it doesn't need to compute as many thresholds. I found lazy truncation gave me more flexibility in our genomic applications in Dashing2.

CSetSketch is a MinHash with independent registers that can be computed efficiently thanks to early stopping. You can just use the minhash bounds directly. Its advantage is rapid computation compared to standard K-Mins, faster comparison than bottom-k hashing. The independent registers are also more powerful than bottom-k for LSH index construction because you can group them for stronger hash functions.

dnbaker / sketch

SetSketch implementations #73