Very slow performance on large sets

ekzhu / SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop

Apache License 2.0

588 stars 40 forks source link

Hi, this package looks really cool and I'd love to use it for my use case.

I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs. datasketch's MinHashLSHEnsemble method, the results are pretty wildly off-base from the numbers presented in the readme.

In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.

Any insight or help is appreciated.

ekzhu / SetSimilaritySearch

Very slow performance on large sets #11