explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'` #57

Open namiyousef opened 3 years ago

namiyousef commented 3 years ago

Hi all,

I started an NLP project where I needed high accuracy sentence segmentation, and therefore decided to use stanza.

I was thrilled to find this library, since Spacy is quite intuitive. However, I found that the sentence segmentation only gets carried into spacy under certain conditions.

Baseline:

The baseline text is to use the Stanza model alone to see if the sentence segmentation works.

This is the simplest model that I could use, I simply turned on the tokenize processor.

Screenshot 2021-02-03 at 18 57 31

Test with Spacy-Stanza:

I then tried the same thing, but this time added the spacy-stanza wrapper.

Screenshot 2021-02-03 at 18 58 00

As shown above, the sentences were not actually tokenized.

Test with spacy-stanza with more processors on Stanza:

Screenshot 2021-02-03 at 18 56 23

It seems that the depparse processor is necessary, but this is rather confusing since the vanilla stanza model does not require it to tokenize.

Any help would be appreciated :)

adrianeboyd commented 3 years ago

Yes, your analysis is correct. A typical spacy pipeline sets the boundaries from the dependency parses (the transition-based parser decides where to set the sentence breaks), so we've set up the wrapper here to work the same way even though the sentence boundaries come from tokenize and not depparse.

It would be a problem to have separate sentence boundaries that potentially conflict with the parses (the Doc can't store both), but here we know that they're consistent because they're only coming from one source in the pipeline.

I'm not sure that there's much benefit to using stanza just for sentence segmentation (I'd be interested to hear about the use case where it's a lot better?) and I'm not sure we want to make this change to spacy-stanza v0.2.x at this point, but here's what it could look like:

https://github.com/adrianeboyd/spacy-stanza/tree/feature/sent-starts