Closed nanthony007 closed 1 year ago
Hi @nanthony007, thanks for making a start at this! If you're interested, I'm happy to help you through all the steps that would be necessary in order to fully replace nmslib with annoy, but there are a few :) Basically it comes down to rebuilding the indexes and then testing them. To rebuild the indexes, you would need to get the create linker script to run and produce the necessary artifacts. The input paths for that script can be found here. Once you have the linkers created, we would need to test a few things.
pip install -e .
.
Total entities: 70262
Correct at 1: 39170 Recall at 1: 0.5574848424468418
Correct at 2: 46132 Recall at 2: 0.6565711195240671
Correct at 10: 53370 Recall at 10: 0.759585551222567
Correct at 40: 55821 Recall at 40: 0.7944692721527995
If we get through all of that, I can upload the new files to s3, and do a new release that removes the nmslib dependency.
btw, none of those steps should require installing nmslib. You can edit the setup.py
to remove the nmslib and add the annoy depdendency, and then install from source with pip install -e .
. I can do any benchmarking against the nmslib version that is necessary.
So upon some quick trials, the annoy index was taking ~5hours to create (by adding items) on my computer, which would crash when trying to build. Turns out it results in ~10GB file on disk when only building for 50k sample of tfidf-vectors.
I think faiss may be another option that seems more similar in API at least to nmslib. Will try tomorrow.
One issue you may run into is that nmslib
has specific support for a sparse cosine similarity, while annoy says it works up to ~1000 dimensions, and faiss says it is for dense vectors.
Okay, given that, I suppose we would expect both of these solutions to fall short. Is there a reason why we did not just use scipy.sparse
and something from sklearn? Not an expert on the implementation details so forgive me.
If you don't see any immediate problems I could try NearestNeighbors from scikit learn? I know that is frequently used with tfidf vectors. Appears one of their metrics is cosine distance as well.
Edited NearestNeighbors
Do you have an idea of how long evaluation script should take? That would give us a quick speed benchmark right? At least to know in the same general performance area....
I would expect approximate nearest neighbors via nmslib to be substantially faster, and better for high dimensional tfidf data than the sklearn exact nearest neighbors algorithm. Running the evaluation script on my machine took ~3 minutes.
You would be right... I ran the evaluation and it took almost 12 hours, with only a slight recall degradation. It did only take ~2minutes to create the model as opposed to 2 hours to create the index.
If you don't have any recommendations I think you can close this PR and its corresponding issue... #473.
I did have another question regarding replacing the current entity linker with spacy's built in knowledge base and entity linker that would thus remove the nmslib dependency as well but I was unsure of the viability of that at this time...
Thanks for all your help on this.
There seems to be no further suggestions so I'm going to close this and the linked issue #473
This PR removes the dependency on
nmslib
and switches toannoy
by Spotify link which appears to be more actively maintained and is a C++ library. All of the changes happen insidescispacy/candidate_generation.py
since, as far as I could tell, that is wherenmslib
was used.Important: Due to
nmslib
not working on my personal computer and me not knowing how to evaluate these changes and the requirements (i.e. generating UMLS knowledge base, indices as are, new indices, and some evaluation metric) I have not actually run this code so it quite likely breaks. I wanted to move beyond an issue since I really like this project and hope to use it at UK and I think this PR serves as a proof of concept for how it may not be that complex to switch off ofnmslib
... then again I could be completely wrong 😄A few important notes on
annoy
vsnmslib
:n_trees
at index build time)k_search
at search/run-time as configurable parametersFinally, as a general note, other that the specific
nmslib
usages I tried to leave the code as much unchanged as possible although it is now possible that certain checks/crashes may not occur.Please advise on next steps, I am willing to test this on a different device (allowing
nmslib
install) but would need guidance on evaluation procedures/metrics.I hope this PR succinctly shows that switching (at least) may be possible without too much overhead and that would open up
scispacy
for usage on more platforms with more python versions.