chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

corpus.add_texts multithreading/processing broken with spaCy 2.0.x #204

Closed danielchalef closed 6 years ago

danielchalef commented 6 years ago

spaCy 2.0.x pipe uses numpy's linked BLAS library for multiprocessing and does not honor n_threads. As a result, passing n_threads to corpus.add_texts is ineffective. This can be worked around by setting an environment variable, however, the MKL (distributed with all Anaconda python installs) requires additional trickery in order to utilize all available cores.

Expected Behavior

Passing the n_threads parameter to corpus.add_texts would result in n_threads threads/processes being spawned to process texts added to the corpus.

Current Behavior

The n_threads parameter is ignored and the thread/process count left up to the BLAS library.

Possible Solution

This is a workaround: Define a BLAS-specific environment variable setting the number of threads. e.g. OMP_NUM_THREADS=16 OPENBLAS_NUM_THREADS=16 MKL_NUM_THREADS=16

Note:

Steps to Reproduce (for bugs)

corpus = textacy.Corpus("en_core_web_lg")

corpus.add_texts(texts=text, metadatas=metadata, n_threads=15)

Only 8 processes were spawned.

Context

Extremely slow parsing of a corpus of several million documents, despite running on a high CPU core machine.

Your Environment

itaibl commented 6 years ago

solution doesn't seems to work. where exactly to define the BLAS-specific environment variables? Thanks!

danielchalef commented 6 years ago

In the shell environment in which you're invoking your python script:

MKL_NUM_THREADS=16 MKL_DYNAMIC=FALSE python my_script.py

itaibl commented 6 years ago

Thanks Daniel. Is there a way of defining this in PyCharm? using os.environ command (os.environ["MKL_NUM_THREADS"] = "16") doesn't seem to work. Thanks!

danielchalef commented 6 years ago

Set the environment variables in the Run/Debug Configuration.