ekzhu / SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop
Apache License 2.0
588 stars 40 forks source link

Very slow performance on large sets #11

Open briannadon opened 2 years ago

briannadon commented 2 years ago

Hi, this package looks really cool and I'd love to use it for my use case.

I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs. datasketch's MinHashLSHEnsemble method, the results are pretty wildly off-base from the numbers presented in the readme.

In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.

Any insight or help is appreciated.

ekzhu commented 2 years ago

You are correct that this package is not suitable comparing to datasketch when it comes to larger sets. The benchmark datasets used in README have average set size around 20-30.

For this package, the query time is directly proportional to the size of the query set (# of tokens). It is also heavily influenced by the size of indexed sets because exact set similarity calculation is made at query time for candidate sets.

There are some algorithmic tricks that are designed to handle exact search over large sets. I made one: https://github.com/ekzhu/josie. I haven't had time to make it available here.