Closed cordone closed 3 years ago
The Japanese models assign lemmas in the tokenizer rather than in the tagger, which is probably the source of the difference. I wouldn't have actually expected this to matter in the multiprocessing setup (since the token text is also added to the stringstore in the tokenizer step), but I can reproduce this and we'll look into it.
This is a problem related to Doc.to/from_bytes
and Doc.to/from_array
, which don't include any vocab strings. For most attributes, this isn't a problem because they're provided by the model as a finite set (tag labels, dependency labels), but lemmas aren't. It's not a problem for the tokens because the text itself it also stored in Doc.to_bytes
and the orth
value can be recovered from that.
Multiprocessing works for languages with a built-in lemmatizer because assign_tag
adds the lemmas back to the string store in Doc.from_array
. And then when Doc.from_array
re-assigns the lemma attribute, the string isn't missing.
The main underlying difference is that the English lemma strings are included in the model and you can lemmatize again to recover the forms but the Japanese lemmas aren't recoverable after the tokenizer step if you don't have the same string store.
Multiprocessing is going to break for any custom lemmatizer or component that assigns annotation that doesn't come from a (finite) set of provided labels, since the child processes have separate vocabs and string stores. It's really a more general problem for any kind of distributing computing. A related issue is #4411.
One workaround is to save the lemmas under a custom extension like some of the other custom Japanese tokenizer data, but this isn't particularly satisfying, since they won't be available under Token.lemma_
.
Hmm, I guess you could also use user_data
to save custom strings in some way? Doc.from_bytes
could add any strings in, say, user_data["strings"]
?
Thanks for the details here @adrianeboyd!
I wanted to share something similar that I'm seeing when using the EntityRuler
with a blank English model. Minimal example, using Python 3.8.5 and Spacy 2.3.2:
import spacy
from spacy.pipeline import EntityRuler
if __name__ == '__main__':
texts = [
"I enjoy eating Pizza Hut pizza."
]
patterns = [
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
]
nlp = spacy.blank("en")
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
for doc in nlp.pipe(texts, n_process=2):
for ent in doc.ents:
print(f"{ent.text}, {ent.ent_id}, {ent.ent_id_}")
You'll observe KeyError [E018] when attempting to extract ent_id_
from the matching entity. The issue goes away when you use a shorter id
string for the pattern (e.g., "id": "123"
will work).
Any thoughts to a workaround?
Thanks for the report, that does look like a related bug. I'll look into it!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
Hi, I'm trying to use the new ja_core_news_sm model to stream process a collection of sentences with
nlp.pipe(list_of_sentences)
. I'd like to be able to setn_process
> 1 to increase speed, but when I do that I encounter KeyError [E018]. I'm using WSL through VSCode.I'll try something like this...
and get this error.
The failing token wasn't the first one in the sentence, so I counted the number of tokens throwing Exceptions and in one collection of sentences I have, 397/53597 iterated tokens cause an Exception (so far the number of failures has stayed constant on re-runs varying
batch_size
andn_process
).Just to sanity check, a bare
nlp.pipe()
or[nlp(s) for s in sentences]
work with no issues. Possibly a model-specific issue?Your Environment