explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Race condition when using select_pipes in multiple threads #11904

Closed johnmccain closed 1 year ago

johnmccain commented 1 year ago

When using a single Language object across multiple threads and using select_pipes to enable & disable pipeline components, you can end up with a race condition where pipeline modifications from one thread overwrite those in another thread.

This race condition can be avoided by creating multiple instances of the Language object or by wrapping pipeline modifications in a lock, but those options either require holding multiple copies of the model in memory or incurring performance penalties due to the lock.

Is the select_pipes context manager expected to be thread safe? If not, should this be a feature request for a thread-safe method of modifying pipelines or running certain components of pipelines?

How to reproduce the behaviour

import threading
import time
import spacy

nlp = spacy.load("en_core_web_sm")

def thread1():
    for _ in range(100):
        with nlp.select_pipes(enable=["tokenizer"]):
            time.sleep(0.01)
            doc = nlp("This is a sentence.")

def thread2():
    for _ in range(100):
        with nlp.select_pipes(enable=["parser"]):
            time.sleep(0.01)
            doc = nlp("This is a sentence.")
            noun_chunks = list(doc.noun_chunks)

t1 = threading.Thread(target=thread1)
t2 = threading.Thread(target=thread2)
t1.start()
t2.start()
t1.join()
t2.join()

output:

Exception in thread Thread-2 (thread2):
Traceback (most recent call last):
  File "/Users/johnmccain/.pyenv/versions/3.10.4/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/Users/johnmccain/.pyenv/versions/3.10.4/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/johnmccain/Documents/Repos/spacy-select-pipe-race-condition/spacy-select-pipe-race-condition.py", line 18, in thread2
    noun_chunks = list(doc.noun_chunks)
  File "spacy/tokens/doc.pyx", line 875, in noun_chunks
  File "/Users/johnmccain/Documents/Repos/spacy-select-pipe-race-condition/venv/lib/python3.10/site-packages/spacy/lang/en/syntax_iterators.py", line 26, in noun_chunks
    raise ValueError(Errors.E029)
ValueError: [E029] `noun_chunks` requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
https://spacy.io/usage/models

Your Environment

polm commented 1 year ago

Thanks for making us aware of this and providing an example. We would not generally expect methods on the Language class like this to be thread-safe, so this isn't exactly surprising, though it is unfortunate. We can treat it as a feature request.

To help us understand the use case better, can you explain why you're doing things this way rather than using, say, multiple nlp objects?

johnmccain commented 1 year ago

Thanks for addressing this! I figured that this would likely end up being a feature request rather than a bug, so that sounds great to me.

The motivation for using a single nlp object is to avoid the additional memory usage that multiple objects would entail. The particular application that this came up in is using en_core_web_lg, so the additional memory usage for loading multiple copies of the model is substantial. The motivation for using select_pipes is for efficiency to avoid wasting compute on features that aren't needed in certain scenarios.

Since submitting this ticket I came up with this solution which appears to allow thread safe pipeline selection:

def run_with_pipes(
        text,
        enable: list | None = None,
        disable: list | None = None
) -> spacy.tokens.Doc:
    if enable is None and disable is None:
        raise ValueError("One of `enable` or `disable` must be set.")
    elif enable is not None and disable is not None:
        raise ValueError("Only one of `enable` or `disable` can be set.")
    elif enable is None:
        enable = [
            name for name, _ in nlp.pipeline if name not in disable
        ]
    doc = nlp._ensure_doc(text)
    for name, component in nlp.pipeline:
        if name in enable:
            doc = component(doc)
    return doc

It would still be nice to have similar functionality available natively in Spacy.

adrianeboyd commented 1 year ago

Along those lines, nlp.pipe already supports disable, so for a single doc, you could use:

doc = next(nlp.pipe(["This is a text."], disable=["parser"]))

(Processing texts in batches with nlp.pipe is a lot more efficient, so we'd recommend batching your input texts and using nlp.pipe no matter what.)

johnmccain commented 1 year ago

Ah! I must have missed that when going through the docs, thank you. That seems like it would take care of this feature request then, so I will go ahead and close this ticket.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.