dell-research-harvard / linktransformer

A convenient way to link, deduplicate, aggregate and cluster data(frames) in Python using deep learning
https://linktransformer.github.io/
GNU General Public License v3.0
105 stars 10 forks source link

Suggestion to implement range_search #15

Open pjlambert opened 8 months ago

pjlambert commented 8 months ago

Hi All, again - wonderful package and just terrific work.

One possible extension you might one day consider would be using FAISS's range_search function, instead of search (see https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#range-search). This would allow for a "many-to-many" match in the more traditional sense, perhaps aligning the behaviour of the LT package to prior fuzzy matching packages.

The main drawback is that it is not GPU-friendly, but works pretty efficiently on CPUs in my experience.

FWIW, my use-case is to match the universe of job-postings to DnB establishments. I use the range_search along with your firm-name embeddings to to build a dataset with all pairwise matches above a pretty low similarity threshold (0.5). This then gives me a huge set of potential matches, and I use an expectation-maximisation algorithm after this which considers both similarity-scores as well as other structured covariates (but not necessarily exact matching criteria) like industry codes, location-distance, etc to resolve the best match from this candidate set.

One day I would be happy to help implementing this, if you feel it's something you would want to pursue.

Thanks again for all the great work, it's hugely appreciated by many!

econabhishek commented 8 months ago

Thanks for the wonderful suggestion, Peter. We too have tried range search on other projects and found it to be great.

Re: only CPU support, it's not a problem with the current version of the package - it only uses cpu faiss (primarily because of dependency issues. Feel free to create a pull request for this. We are going to update it soon (along with the paper and the models - we found a way to increase the amount of data available to us), so if you haven't made a request by then, I can implement it around mid-March.

We are thinking of creating a GPU only branch (but not offered a pip package- primarily because dependency management is a bit messed up with faiss GPU and other packages required - pip install X doesn't work well) for more scaled up applications.

I am glad that the package is working well for you. Hopefully we'll get close to a version 1.x.x soon.

Abhishek