allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.66k stars 223 forks source link

Remove NMSLIB dependency #473

Closed nanthony007 closed 1 year ago

nanthony007 commented 1 year ago

I'm not sure if this would be possible and what alternatives may even exist, BUT, due to years of inactivity and unresponsiveness on the primary nmslib maintainer's side (not faulting him), the nmslib dependency makes scispacy very unaccessible to new users and, in fact, will remain completely inaccessible to users on new operating systems (Windows 11) or running modern versions of python (3.11).

Are there any possible alternatives for the few lines of code where this package uses nmslib?

From what I can see those are primarily two calls to nmslib.init() and otherwise type annotations.

Please advise, if possible I would love to help here but am not comfortable writing robust production C++ code nor am I an expert on the scispacy models themselves.

dakinggg commented 1 year ago

Hi @nanthony007, replacing nmslib with another approximate nearest neighbor search library is certainly doable, but is a bit more involved than you might realize. The candidate generator (https://github.com/allenai/scispacy/blob/4f9ba0931d216ddfb9a8f01334d76cfb662738ae/scispacy/candidate_generation.py#L148) uses nmslib for the approximate nearest neighbor search, so we would need to swap that out for another library, which means recreating an index with a different library (https://github.com/allenai/scispacy/blob/4f9ba0931d216ddfb9a8f01334d76cfb662738ae/scispacy/candidate_generation.py#L365 for doing it with nmslib), then rewriting the code to load and use that index, and then evaluating the candidate generation to make sure speed and accuracy are still on par with the previous implementation. This is unfortunately not something I am likely to have time to do in the near future, but I will try.

That being said, I have recently installed nmslib successfully on Windows Subsystem for Linux with python 3.10. 3.11 likely does not work, as you say.

nanthony007 commented 1 year ago

I ideally wanted to include scispacy as a dependency of a package for more novice programmers to have some simple access to biomedical NER and using WSL and/or navigating dependency (python, scispacy, etc) versions seems like mental overhead I want to avoid.

Is there a way this model could be re-trained using spacy's new entity linker itself? Could that accomplish the same NEL while benefiting from scispacy's models?

nanthony007 commented 1 year ago

I wonder if annoy could be a good fit for an alternative ANN index?

nanthony007 commented 1 year ago

Please see #481

nanthony007 commented 1 year ago

Closing due to no clear direction forward...

phaeta commented 1 year ago

@nanthony007 I was able to build scispacy for Python 3.11 by using the latest pybind11 (2.10.4) and building nmslib from the master branch, e.g.:

# in a clean virtual environment
pip install pybind11==2.10.4
pip install "nmslib @ git+https://github.com/nmslib/nmslib.git@ade4bcdc9dd3719990de2503871450b8a62df4a5/#subdirectory=python_bindings"
pip install scispacy
...

(ade4bcdc9dd3719990de2503871450b8a62df4a5 was the last commit to master; quite awhile ago).

dakinggg commented 1 year ago

Thanks @phaeta ! Could you share what OS you are on?

phaeta commented 1 year ago

@dakinggg macOS Ventura

nanthony007 commented 1 year ago

Unfortunately I am unable to replicate this. Copying your git install command resulted in git not finding the revision. Upon removing the trailing "/" pip attempts to build the wheels and install but fails during the Clang build. @phaeta are you on M1 or Intel? Are you using conda python?

The build errors I am getting appear to be around SIMD and Scalars...

-std=c++14 -fvisibility=hidden
  ./similarity_search/src/distcomp_scalar.cc:85:9: error: pragma message requires parenthesized string
  #pragma message WARN("ScalarProductSIMD<float>: SSE2 is not available, defaulting to pure C++ implementation!")
          ^
  ./similarity_search/src/distcomp_scalar.cc:169:18: warning: explicit instantiation of 'NormScalarProductSIMD<float>' that occurs after an explicit specialization has no effect [-Winstantiation-after-specialization]
  template float   NormScalarProductSIMD<float>(const float* pVect1, const float* pVect2, size_t qty);
                   ^
  ./similarity_search/src/distcomp_scalar.cc:83:7: note: previous template specialization is here
  float NormScalarProductSIMD(const float* pVect1, const float* pVect2, size_t qty) {
        ^
  ./similarity_search/src/distcomp_scalar.cc:195:9: error: pragma message requires parenthesized string
  #pragma message WARN("ScalarProductSIMD<float>: SSE2 is not available, defaulting to pure C++ implementation!")
          ^
  ./similarity_search/src/distcomp_scalar.cc:246:18: warning: explicit instantiation of 'ScalarProductSIMD<float>' that occurs after an explicit specialization has no effect [-Winstantiation-after-specialization]
  template float   ScalarProductSIMD<float>(const float* pVect1, const float* pVect2, size_t qty);
                   ^
  ./similarity_search/src/distcomp_scalar.cc:193:7: note: previous template specialization is here
  float ScalarProductSIMD(const float* pVect1, const float* pVect2, size_t qty) {
        ^
  2 warnings and 2 errors generated.
  error: command '/usr/bin/clang' failed with exit code 1
phaeta commented 1 year ago

@nanthony007 Try this: pip install "nmslib @ git+https://github.com/nmslib/nmslib.git/#subdirectory=python_bindings"

Regarding architecture, I'm using an Intel Mac. I'm using python@3.11 from Homebrew. Also the master-branch nmslib build works for me on Linux (Ubuntu 20.04 (aarch64), Python 3.11 built from source).

I'll play around with this in a container and put together a Dockerfile. Also, I have access to an M1 Mac Mini; I'll try things there too. Stay tuned

nanthony007 commented 1 year ago

Okay thanks! That command also does not work so maybe it's something with M1? My main concern is M1 and Windows 11 support since I think most students will likely be on those platforms.

fsecada01 commented 11 months ago

@nanthony007 Try this: pip install "nmslib @ git+https://github.com/nmslib/nmslib.git/#subdirectory=python_bindings"

Regarding architecture, I'm using an Intel Mac. I'm using python@3.11 from Homebrew. Also the master-branch nmslib build works for me on Linux (Ubuntu 20.04 (aarch64), Python 3.11 built from source).

I'll play around with this in a container and put together a Dockerfile. Also, I have access to an M1 Mac Mini; I'll try things there too. Stay tuned

I can confirm that this works for Windows 11 and Python 3.11.

umayerr commented 1 month ago

@nanthony007 Try this: pip install "nmslib @ git+https://github.com/nmslib/nmslib.git/#subdirectory=python_bindings" Regarding architecture, I'm using an Intel Mac. I'm using python@3.11 from Homebrew. Also the master-branch nmslib build works for me on Linux (Ubuntu 20.04 (aarch64), Python 3.11 built from source). I'll play around with this in a container and put together a Dockerfile. Also, I have access to an M1 Mac Mini; I'll try things there too. Stay tuned

I can confirm that this works for Windows 11 and Python 3.11.

Unfortunately, this solution isn't working for my Intel machine. I'm running Debian 12 with Python 3.11. Has anyone tested this on Linux?

note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure × Encountered error while trying to install package. ╰─> nmslib