Open pege345 opened 1 year ago
I can reproduce this, but it's probably related to torch
rather than spacy
directly and I'm not as sure about what might be going on in torch that would cause this. We'll take a look!
What we'd recommend instead as the first alternative to try is our built-in multiprocessing with nlp.pipe
:
import spacy
import torch
torch.set_num_threads(1)
nlp = spacy.load("en_core_web_trf")
input =[
"CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
"It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
"Hard working laborers visited CoCo Town to congregate at the diner.",
"During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
"Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]
for i in range(10):
print(sum(len(doc.ents) for doc in nlp.pipe(input, n_process=4)))
Notes:
torch.set_num_threads(1)
to avoid a deadlock related to multiprocessing with torch (more details in #4667).nlp.pipe(n_process=)
for multiprocessing, you should process texts in batches with nlp.pipe
for improved speed.
When running data through the en_core_web_trf model concurrently I am getting different results between runs. I cannot find anywhere in the documentation or other github issues where this behaviour is explained.
The below code reproduces the behaviour, If I don't run data through the pipeline concurrently (e.g. setting max_workers=1) I find the result to always be consistent.
Your Environment
Amazon Linux 2
Kernel:Linux 4.14.294-220.533.amzn2.x86_64
python 3.7.10
3.1.3
en-core-web-trf==3.1.0