explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.25k stars 4.4k forks source link

Inconsistent NER predictions from identical inputs while using ThreadPoolExecutor #11868

Open pege345 opened 1 year ago

pege345 commented 1 year ago

When running data through the en_core_web_trf model concurrently I am getting different results between runs. I cannot find anywhere in the documentation or other github issues where this behaviour is explained.

The below code reproduces the behaviour, If I don't run data through the pipeline concurrently (e.g. setting max_workers=1) I find the result to always be consistent.

import spacy
from concurrent.futures import ThreadPoolExecutor

nlp = spacy.load("en_core_web_trf")

def extract_entities(sentences):
    with ThreadPoolExecutor(max_workers=4) as e:
        submitted = [e.submit(call_spacy, sent) for sent in sentences]
        resolved = [item.result() for item in submitted]

        return resolved

def call_spacy(sent):
    result = nlp(sent)
    return result.ents

input =[
    "CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
    "It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
    "Hard working laborers visited CoCo Town to congregate at the diner.",
    "During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
    "Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    result = extract_entities(input)
    print(sum([len(x) for x in result]))

Your Environment

adrianeboyd commented 1 year ago

I can reproduce this, but it's probably related to torch rather than spacy directly and I'm not as sure about what might be going on in torch that would cause this. We'll take a look!

What we'd recommend instead as the first alternative to try is our built-in multiprocessing with nlp.pipe:

import spacy
import torch

torch.set_num_threads(1)

nlp = spacy.load("en_core_web_trf")

input =[
        "CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
        "It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
        "Hard working laborers visited CoCo Town to congregate at the diner.",
        "During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
        "Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    print(sum(len(doc.ents) for doc in nlp.pipe(input, n_process=4)))

Notes: