Fix removing tokens that appear in query file but not index file from query sets

ekzhu / SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop

Apache License 2.0

589 stars 40 forks source link

Fix removing tokens that appear in query file but not index file from query sets #15

Closed innovate-invent closed 2 years ago

innovate-invent commented 2 years ago

I believe this is an effective fix, but I am not entirely sure what the consequences of using negative indices is.

Resolves #13

ekzhu commented 2 years ago

Thanks for the pull request. I think maybe a more robust solution is to modify the similarity function to add set sizes as new arguments. So we can use a different size than the set of tokens into the function. e.g., we can use the actual query set size rather than the size of the subset of tokens that exist in the index.

ekzhu commented 2 years ago

I made the required changes. Can you help me verify if the changes are correct by adding a unit test for your scenario? Thanks!

innovate-invent commented 2 years ago

I ran the test on master and this branch, it fails on master and passes here.