allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.72k stars 229 forks source link

Convert nmslib to annoy #481

Closed nanthony007 closed 1 year ago

nanthony007 commented 1 year ago

This PR removes the dependency on nmslib and switches to annoy by Spotify link which appears to be more actively maintained and is a C++ library. All of the changes happen inside scispacy/candidate_generation.py since, as far as I could tell, that is where nmslib was used.

Important: Due to nmslib not working on my personal computer and me not knowing how to evaluate these changes and the requirements (i.e. generating UMLS knowledge base, indices as are, new indices, and some evaluation metric) I have not actually run this code so it quite likely breaks. I wanted to move beyond an issue since I really like this project and hope to use it at UK and I think this PR serves as a proof of concept for how it may not be that complex to switch off of nmslib... then again I could be completely wrong 😄

A few important notes on annoy vs nmslib:

Finally, as a general note, other that the specific nmslib usages I tried to leave the code as much unchanged as possible although it is now possible that certain checks/crashes may not occur.

Please advise on next steps, I am willing to test this on a different device (allowing nmslib install) but would need guidance on evaluation procedures/metrics.

I hope this PR succinctly shows that switching (at least) may be possible without too much overhead and that would open up scispacy for usage on more platforms with more python versions.

dakinggg commented 1 year ago

Hi @nanthony007, thanks for making a start at this! If you're interested, I'm happy to help you through all the steps that would be necessary in order to fully replace nmslib with annoy, but there are a few :) Basically it comes down to rebuilding the indexes and then testing them. To rebuild the indexes, you would need to get the create linker script to run and produce the necessary artifacts. The input paths for that script can be found here. Once you have the linkers created, we would need to test a few things.

  1. The accuracy is roughly similar to the accuracy using nmslib. You can check that by running the evaluation script. The accuracy should be roughly as below for the main UMLS linker. Note: running the eval script (and development in general) will require that you install scispacy from source with pip install -e ..
    Total entities:  70262
    Correct at 1:  39170 Recall at 1:  0.5574848424468418
    Correct at 2:  46132 Recall at 2:  0.6565711195240671
    Correct at 10:  53370 Recall at 10:  0.759585551222567
    Correct at 40:  55821 Recall at 40:  0.7944692721527995
  2. The speed is roughly similar to nmslib. I haven't written a speed benchmarking script for the linker, but it should be fairly straightforward, and I can measure the speed of nmslib for the new method if needed.
  3. The file size is roughly similar to nmslib. If annoy produces files that are much larger than the files that nmslib produces, that might be a problem. I don't have any reason to think it would, just something to check.

If we get through all of that, I can upload the new files to s3, and do a new release that removes the nmslib dependency.

dakinggg commented 1 year ago

btw, none of those steps should require installing nmslib. You can edit the setup.py to remove the nmslib and add the annoy depdendency, and then install from source with pip install -e .. I can do any benchmarking against the nmslib version that is necessary.

nanthony007 commented 1 year ago

So upon some quick trials, the annoy index was taking ~5hours to create (by adding items) on my computer, which would crash when trying to build. Turns out it results in ~10GB file on disk when only building for 50k sample of tfidf-vectors.

I think faiss may be another option that seems more similar in API at least to nmslib. Will try tomorrow.

dakinggg commented 1 year ago

One issue you may run into is that nmslib has specific support for a sparse cosine similarity, while annoy says it works up to ~1000 dimensions, and faiss says it is for dense vectors.

nanthony007 commented 1 year ago

Okay, given that, I suppose we would expect both of these solutions to fall short. Is there a reason why we did not just use scipy.sparse and something from sklearn? Not an expert on the implementation details so forgive me.

nanthony007 commented 1 year ago

If you don't see any immediate problems I could try NearestNeighbors from scikit learn? I know that is frequently used with tfidf vectors. Appears one of their metrics is cosine distance as well.

Edited NearestNeighbors

nanthony007 commented 1 year ago

Do you have an idea of how long evaluation script should take? That would give us a quick speed benchmark right? At least to know in the same general performance area....

dakinggg commented 1 year ago

I would expect approximate nearest neighbors via nmslib to be substantially faster, and better for high dimensional tfidf data than the sklearn exact nearest neighbors algorithm. Running the evaluation script on my machine took ~3 minutes.

nanthony007 commented 1 year ago

You would be right... I ran the evaluation and it took almost 12 hours, with only a slight recall degradation. It did only take ~2minutes to create the model as opposed to 2 hours to create the index.

CleanShot 2023-05-16 at 07 11 50@2x
nanthony007 commented 1 year ago

If you don't have any recommendations I think you can close this PR and its corresponding issue... #473.

I did have another question regarding replacing the current entity linker with spacy's built in knowledge base and entity linker that would thus remove the nmslib dependency as well but I was unsure of the viability of that at this time...

Thanks for all your help on this.

nanthony007 commented 1 year ago

There seems to be no further suggestions so I'm going to close this and the linked issue #473