explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

Mixing stanfordNLP model with spacy NER not working #19

Closed OmriPi closed 4 years ago

OmriPi commented 5 years ago

Hi, I'm using spacy for extracting entities from documents. The NER component is very good, however, the sentence splitting on my documents (legal type documents with long sentences) is quite horrible, while stanfordNLP splits the sentences quite well. I wanted to use the StanfordNLP model along with the NER pipe from spacy to have the best of both worlds. However, when I run almost the exact code shown in the example of how to do it (except the model is the large model and the text is the text of my document)

snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)
spacy_model = en_core_web_lg.load()
ner = spacy_model.get_pipe("ner")
nlp.add_pipe(ner)
doc = nlp(text)

and try to loop over the entities, I'm getting a

../aten/src/ATen/native/LegacyDefinitions.cpp:14: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
../aten/src/ATen/native/LegacyDefinitions.cpp:14: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
../aten/src/ATen/native/LegacyDefinitions.cpp:14: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
Traceback (most recent call last):
  File "/Users/omri/PycharmProjects/NER/main.py", line 58, in <module>
    find_entities(text)
  File "/Users/omri/PycharmProjects/NER/main.py", line 19, in find_entities
    for ent in doc.ents:
  File "doc.pyx", line 512, in spacy.tokens.doc.Doc.ents.__get__
  File "span.pyx", line 118, in spacy.tokens.span.Span.__cinit__
ValueError: [E084] Error assigning label ID 9191306739292312949 to span: not in StringStore.

I think this points to differences in the vocab of the spacy model and the StanfordNLP model. I'm wondering, how can it be fixed?

Thanks!

ines commented 5 years ago

I think this points to differences in the vocab of the spacy model and the StanfordNLP model.

Yes, seems like that's the case! What happens if you try the following and explitly overwrite the vocab with the other model's vocab?

nlp = StanfordNLPLanguage(snlp)
nlp.vocab = spacy_model.vocab

the sentence splitting on my documents (legal type documents with long sentences) is quite horrible, while stanfordNLP splits the sentences quite well

If you're only using the model for sentence segmentation, you probably want to disable all the other components when you create the snlp object to make it load a bit faster! Also, just out of curiosity: Have you also tried spaCy's simpler, rule-based sentence segmentation? The default segmentation uses the dependency parser which is usually more accurate – but it makes sense that the parser would be quite confused by legal text, which then in turn makes the segmentation worse.

Also, speaking of legal text: If you haven't seen it yet, you might also want to check out @ICLRandD's blackstone project, a custom spaCy pipeline and model for legal text 🙂

OmriPi commented 5 years ago

Hi @ines thanks a lot for the super quick response! I've tried your solution, however I am still getting the same outcome as before... Any other ideas? Also, how can I disable the other components like you said? The pipe of the StanfordNLP appears to be empty when I check it. I have not tried the rule based segmentation yet since I hoped I can come up with a solution for the StanfordNLP which already performed well in other cases, but I may have to resort to trying it.

ines commented 5 years ago

Ah, that's strange! Anyway, the underlying problem here seem to be the entity labels, which are not in the string store. So if you just add them manually, it should work:

for label in ner.labels:
    nlp.vocab.strings.add(label)

Also, how can I disable the other components like you said? The pipe of the StanfordNLP appears to be empty when I check it.

Sorry if I phrased this in a confusing way. Because of how the StanfordNLP model structures the output, this wrapper sets all annotations in the tokenizer, so the spaCy pipeline is empty.

However, when you load the StanfordNLP model, you can explicitly specify the processors to load and use – e.g. tokenize if you only want to tokenize, and so on. See here: https://stanfordnlp.github.io/stanfordnlp/pipeline.html#basic-example

OmriPi commented 5 years ago

Ah ok, thank you very much @ines! I will try to check if it works now (hopefully it will). Thanks for the clarification on the StanfordNLP pipe. Meanwhile I tried to use the simple rule-based sentencizer and it seems to perform much better than the dependency parser, so perhaps I'll stick with it, it seems good enough for what I'm trying to accomplish. Thanks again!