Closed OmriPi closed 4 years ago
I think this points to differences in the vocab of the spacy model and the StanfordNLP model.
Yes, seems like that's the case! What happens if you try the following and explitly overwrite the vocab with the other model's vocab?
nlp = StanfordNLPLanguage(snlp)
nlp.vocab = spacy_model.vocab
the sentence splitting on my documents (legal type documents with long sentences) is quite horrible, while stanfordNLP splits the sentences quite well
If you're only using the model for sentence segmentation, you probably want to disable all the other components when you create the snlp
object to make it load a bit faster! Also, just out of curiosity: Have you also tried spaCy's simpler, rule-based sentence segmentation? The default segmentation uses the dependency parser which is usually more accurate – but it makes sense that the parser would be quite confused by legal text, which then in turn makes the segmentation worse.
Also, speaking of legal text: If you haven't seen it yet, you might also want to check out @ICLRandD's blackstone
project, a custom spaCy pipeline and model for legal text 🙂
Hi @ines thanks a lot for the super quick response! I've tried your solution, however I am still getting the same outcome as before... Any other ideas? Also, how can I disable the other components like you said? The pipe of the StanfordNLP appears to be empty when I check it. I have not tried the rule based segmentation yet since I hoped I can come up with a solution for the StanfordNLP which already performed well in other cases, but I may have to resort to trying it.
Ah, that's strange! Anyway, the underlying problem here seem to be the entity labels, which are not in the string store. So if you just add them manually, it should work:
for label in ner.labels:
nlp.vocab.strings.add(label)
Also, how can I disable the other components like you said? The pipe of the StanfordNLP appears to be empty when I check it.
Sorry if I phrased this in a confusing way. Because of how the StanfordNLP model structures the output, this wrapper sets all annotations in the tokenizer, so the spaCy pipeline is empty.
However, when you load the StanfordNLP model, you can explicitly specify the processors to load and use – e.g. tokenize
if you only want to tokenize, and so on. See here: https://stanfordnlp.github.io/stanfordnlp/pipeline.html#basic-example
Ah ok, thank you very much @ines! I will try to check if it works now (hopefully it will). Thanks for the clarification on the StanfordNLP pipe. Meanwhile I tried to use the simple rule-based sentencizer and it seems to perform much better than the dependency parser, so perhaps I'll stick with it, it seems good enough for what I'm trying to accomplish. Thanks again!
Hi, I'm using spacy for extracting entities from documents. The NER component is very good, however, the sentence splitting on my documents (legal type documents with long sentences) is quite horrible, while stanfordNLP splits the sentences quite well. I wanted to use the StanfordNLP model along with the NER pipe from spacy to have the best of both worlds. However, when I run almost the exact code shown in the example of how to do it (except the model is the large model and the text is the text of my document)
and try to loop over the entities, I'm getting a
I think this points to differences in the vocab of the spacy model and the StanfordNLP model. I'm wondering, how can it be fixed?
Thanks!