ekzhu / SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop
Apache License 2.0
589 stars 40 forks source link

SetSimilaritySearch on Scale #5

Open variux opened 4 years ago

variux commented 4 years ago

Is there any possibility of integration using redis or cassandra as already Minhash LSH has?

ekzhu commented 4 years ago

Integrating with redis or other external storage layer is definitely possible. However I would consider the issue of I/O cost with external storage -- sets of original data and posting lists (the data structured used in this library) can be much bigger than MinHash and LSH, so a Python compute layer + Redis/Cassandra storage layer may be inefficient due to large number of I/Os. A more efficient implementation needs to consider the costs, adding a lot of complexity. I do have an algorithm to solve this problem (JOSIE, VLDB 2019, Github), but I haven't had time to write a production-ready library for this.