Closed johnmccain closed 1 year ago
Thanks for making us aware of this and providing an example. We would not generally expect methods on the Language class like this to be thread-safe, so this isn't exactly surprising, though it is unfortunate. We can treat it as a feature request.
To help us understand the use case better, can you explain why you're doing things this way rather than using, say, multiple nlp
objects?
Thanks for addressing this! I figured that this would likely end up being a feature request rather than a bug, so that sounds great to me.
The motivation for using a single nlp object is to avoid the additional memory usage that multiple objects would entail. The particular application that this came up in is using en_core_web_lg
, so the additional memory usage for loading multiple copies of the model is substantial. The motivation for using select_pipes
is for efficiency to avoid wasting compute on features that aren't needed in certain scenarios.
Since submitting this ticket I came up with this solution which appears to allow thread safe pipeline selection:
def run_with_pipes(
text,
enable: list | None = None,
disable: list | None = None
) -> spacy.tokens.Doc:
if enable is None and disable is None:
raise ValueError("One of `enable` or `disable` must be set.")
elif enable is not None and disable is not None:
raise ValueError("Only one of `enable` or `disable` can be set.")
elif enable is None:
enable = [
name for name, _ in nlp.pipeline if name not in disable
]
doc = nlp._ensure_doc(text)
for name, component in nlp.pipeline:
if name in enable:
doc = component(doc)
return doc
It would still be nice to have similar functionality available natively in Spacy.
Along those lines, nlp.pipe
already supports disable
, so for a single doc, you could use:
doc = next(nlp.pipe(["This is a text."], disable=["parser"]))
(Processing texts in batches with nlp.pipe
is a lot more efficient, so we'd recommend batching your input texts and using nlp.pipe
no matter what.)
Ah! I must have missed that when going through the docs, thank you. That seems like it would take care of this feature request then, so I will go ahead and close this ticket.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
When using a single Language object across multiple threads and using
select_pipes
to enable & disable pipeline components, you can end up with a race condition where pipeline modifications from one thread overwrite those in another thread.This race condition can be avoided by creating multiple instances of the Language object or by wrapping pipeline modifications in a lock, but those options either require holding multiple copies of the model in memory or incurring performance penalties due to the lock.
Is the
select_pipes
context manager expected to be thread safe? If not, should this be a feature request for a thread-safe method of modifying pipelines or running certain components of pipelines?How to reproduce the behaviour
output:
Your Environment