allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

rxnorm linker doesn't work with multiprocessing? #345

Open kpich opened 3 years ago

kpich commented 3 years ago

Hi, I'm getting an error trying to run nlp.pipe with n_processes > 1, I think because the pickling that multiprocessing does under the hood interacts poorly with nmslib.dist.FloatIndex, which the rxnorm entity linker requires and does not seem picklable.

Minimal code:

import spacy
import scispacy
from scispacy.linking import EntityLinker

TEXTS = ["Hello! This is document 1.", "And here's doc 2."]

if __name__ == '__main__':
  nlp = spacy.load("en_core_sci_sm")
  nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True,
                                          "linker_name": "rxnorm"})
  for doc in nlp.pipe(TEXTS, n_process=2):
    print(doc)

Running with Python 3.8.5 gives me:

Traceback (most recent call last):
  File "./mwerror.py", line 13, in <module>
    for doc in nlp.pipe(TEXTS, n_process=2):
  File ".../python3.8/site-packages/spacy/language.py", line 1479, in pipe
    for doc in docs:
  File ".../python3.8/site-packages/spacy/language.py", line 1515, in _multiprocessing_pipe
    proc.start()
  File ".../python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File ".../python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File ".../python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File ".../python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File ".../python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File ".../python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File ".../python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'nmslib.dist.FloatIndex' object

Note I don't get an error with n_process=1, presumably because multiprocessing is not invoked.

I also do not get this error if I don't include the linker pipe (i.e. comment out the add_pipe() line above).

Thanks! This lib is great!

kpich commented 3 years ago

Hey, seems like it works as expected (i.e. doesn't crash) on linux? Error above was from running on OSX 10.14.6.

(FYI I suspect it might something to do with multiprocessing using spawn rather than fork by default on OSX as of py3.8 [doc link] but IDK)

dakinggg commented 3 years ago

Interesting, not sure off the top of my head. Leaving this open for now, let me know if you happen to resolve anything. At a minimum, you could do the parallelization yourself, but ideally it would work with spacy's parallelization.

kpich commented 3 years ago

I actually initially tried doing the parallelization myself with joblib, calling nlp() inside the parallelized code, and it gave me the same error as the spacy nlp.pipe snippet I posted.

Will let you know if I come across anything, but it seems to work fine on linux fwiw.