Open namiyousef opened 3 years ago
Yes, your analysis is correct. A typical spacy pipeline sets the boundaries from the dependency parses (the transition-based parser decides where to set the sentence breaks), so we've set up the wrapper here to work the same way even though the sentence boundaries come from tokenize
and not depparse
.
It would be a problem to have separate sentence boundaries that potentially conflict with the parses (the Doc
can't store both), but here we know that they're consistent because they're only coming from one source in the pipeline.
I'm not sure that there's much benefit to using stanza
just for sentence segmentation (I'd be interested to hear about the use case where it's a lot better?) and I'm not sure we want to make this change to spacy-stanza
v0.2.x at this point, but here's what it could look like:
https://github.com/adrianeboyd/spacy-stanza/tree/feature/sent-starts
Hi all,
I started an NLP project where I needed high accuracy sentence segmentation, and therefore decided to use stanza.
I was thrilled to find this library, since Spacy is quite intuitive. However, I found that the sentence segmentation only gets carried into spacy under certain conditions.
Baseline:
The baseline text is to use the Stanza model alone to see if the sentence segmentation works.
This is the simplest model that I could use, I simply turned on the
tokenize
processor.Test with Spacy-Stanza:
I then tried the same thing, but this time added the spacy-stanza wrapper.
As shown above, the sentences were not actually tokenized.
Test with spacy-stanza with more processors on Stanza:
It seems that the
depparse
processor is necessary, but this is rather confusing since the vanilla stanza model does not require it to tokenize.Any help would be appreciated :)