allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.71k stars 229 forks source link

EntityLinker import hangs #520

Closed mezarque closed 1 month ago

mezarque commented 3 months ago

I've been trying to import EntityLinker but running into an unusual issue where the kernel hangs for a very long time (so far I've let it run up to 93 minutes) without dying or producing an error.

I know there are some previous issues that were related to nmslib (e.g. #365, #372, #437, #446). These seemed to result in a zsh: illegal hardware instruction error, which I don't seem to be encountering.

I eventually figured out how to resolve this, but wanted to share my solution, in case anyone else runs into the same problem.

Hardware / OS

I'm using a 2021 MacBook Pro with an Apple M1 Pro chip, running macOS Ventura 13.1.

Steps

  1. Create a conda environment using conda create -n scispacy python=3.9. I'm using conda 24.7.1.
  2. conda activate scispacy
  3. pip install scispacy
  4. pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
  5. Start an interactive Python session with python
  6. Run the following code:

    import spacy
    nlp = spacy.load("en_core_sci_sm")

    Receive warning:

    /Users/dennis/miniconda3/envs/scispacy/lib/python3.9/site-packages/spacy/language.py:2195: FutureWarning: Possible set union at position 6328
      deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]
  7. Continue with

    doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")

    No problems.

  8. Run the following code:

    import spacy
    
    from scispacy.abbreviation import AbbreviationDetector
    
    nlp = spacy.load("en_core_sci_sm")
    
    # Add the abbreviation pipe to the spacy pipeline.
    nlp.add_pipe("abbreviation_detector")
    
    doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
               inherited motor neuron disease caused by the expansion \
               of a polyglutamine tract within the androgen receptor (AR). \
               SBMA can be caused by this easily.")
    
    print("Abbreviation", "\t", "Definition")
    for abrv in doc._.abbreviations:
        print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

    No problems.

  9. Run the following code:
    import scispacy

    No problems.

  10. Run the following code:
    from scispacy.linking import EntityLinker

    Kernel hangs for a very long time without dying.

Attempts

  1. I've encountered the same behavior in an interactive Python session, as well as when running the code within a Jupyter notebook.
  2. I tried uninstalling nmslib with pip uninstall nmslib and reinstalling with each of the following strategies:
    • pip install --no-binary :all: nmslib (suggested here)
    • CFLAGS="-mavx -DWARN(a)=(a)" pip install nmslib (suggested here)

Solution

Installing nmslib using conda (I used mamba) appeared to solve the issue.

mamba install nmslib

This installed nmslib 2.1.1, which appears to be a newer version than what is specified in requirements.in and setup.py (nmslib>=1.7.3.6). Might upgrading the version there be a good idea? I'm not sure what other issues that would introduce.

dakinggg commented 2 months ago

Wow, thank you! This solution seems to work for me on both Windows and Linux with python 3.11, which hasn't previously worked. Thank you for sharing! I will respond to some other issues and see if it works for others and then update the installation instructions.

dakinggg commented 1 month ago

I just added a support matrix based on what im able to test or glean from previous github issues, so going to go ahead and close this issue. Thanks again for the suggestion!