explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

KeyError [E018] when using nlp.pipe with n_process > 1 #5643

Closed cordone closed 3 years ago

cordone commented 4 years ago

How to reproduce the behaviour

Hi, I'm trying to use the new ja_core_news_sm model to stream process a collection of sentences with nlp.pipe(list_of_sentences). I'd like to be able to set n_process > 1 to increase speed, but when I do that I encounter KeyError [E018]. I'm using WSL through VSCode.

I'll try something like this...


Word = collections.namedtuple("Word", ["surface", "lemma", "upos", "xpos", "dep"])

nlp = spacy.load("ja_core_news_sm", disable=["ner", "entity_linker"])

for doc in nlp.pipe(sentences, batch_size=150, n_process=2):
   for token in doc:
      word = Word(surface=token.text, lemma=token.lemma_, upos=token.pos_, xpos=token.tag_, dep=token.dep_)

and get this error.

word = Word(surface=token.text, lemma=token.lemma_, upos=token.pos_, xpos=token.tag_, dep=token.dep_)
  File "token.pyx", line 894, in spacy.tokens.token.Token.lemma_.__get__
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
 KeyError: "[E018] Can't retrieve string for hash '17260935250788936050'. This usually refers to an issue with the `Vocab` or `StringStore`."

The failing token wasn't the first one in the sentence, so I counted the number of tokens throwing Exceptions and in one collection of sentences I have, 397/53597 iterated tokens cause an Exception (so far the number of failures has stayed constant on re-runs varying batch_size and n_process).

Just to sanity check, a bare nlp.pipe() or [nlp(s) for s in sentences] work with no issues. Possibly a model-specific issue?

Your Environment

adrianeboyd commented 4 years ago

The Japanese models assign lemmas in the tokenizer rather than in the tagger, which is probably the source of the difference. I wouldn't have actually expected this to matter in the multiprocessing setup (since the token text is also added to the stringstore in the tokenizer step), but I can reproduce this and we'll look into it.

adrianeboyd commented 4 years ago

This is a problem related to Doc.to/from_bytes and Doc.to/from_array, which don't include any vocab strings. For most attributes, this isn't a problem because they're provided by the model as a finite set (tag labels, dependency labels), but lemmas aren't. It's not a problem for the tokens because the text itself it also stored in Doc.to_bytes and the orth value can be recovered from that.

Multiprocessing works for languages with a built-in lemmatizer because assign_tag adds the lemmas back to the string store in Doc.from_array. And then when Doc.from_array re-assigns the lemma attribute, the string isn't missing.

The main underlying difference is that the English lemma strings are included in the model and you can lemmatize again to recover the forms but the Japanese lemmas aren't recoverable after the tokenizer step if you don't have the same string store.

Multiprocessing is going to break for any custom lemmatizer or component that assigns annotation that doesn't come from a (finite) set of provided labels, since the child processes have separate vocabs and string stores. It's really a more general problem for any kind of distributing computing. A related issue is #4411.

One workaround is to save the lemmas under a custom extension like some of the other custom Japanese tokenizer data, but this isn't particularly satisfying, since they won't be available under Token.lemma_.

Hmm, I guess you could also use user_data to save custom strings in some way? Doc.from_bytes could add any strings in, say, user_data["strings"]?

macarlin commented 3 years ago

Thanks for the details here @adrianeboyd!

I wanted to share something similar that I'm seeing when using the EntityRuler with a blank English model. Minimal example, using Python 3.8.5 and Spacy 2.3.2:

import spacy
from spacy.pipeline import EntityRuler

if __name__ == '__main__':
    texts = [
        "I enjoy eating Pizza Hut pizza."
    ]

    patterns = [
        {"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
    ]

    nlp = spacy.blank("en")
    ruler = EntityRuler(nlp)
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)

    for doc in nlp.pipe(texts, n_process=2):
        for ent in doc.ents:
            print(f"{ent.text}, {ent.ent_id}, {ent.ent_id_}")

You'll observe KeyError [E018] when attempting to extract ent_id_ from the matching entity. The issue goes away when you use a shorter id string for the pattern (e.g., "id": "123" will work).

Any thoughts to a workaround?

adrianeboyd commented 3 years ago

Thanks for the report, that does look like a related bug. I'll look into it!

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.