Custom sentence segmentization

h0gar commented 1 year ago

Hi,

With Spacy, I would normally do this to use a custom sentencizer.

from spacy.language import Language
@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        token.is_sent_start = False
    return doc
nlp.add_pipe('segm', first=True)

But if I do that with spacy-stanza, I get the following error:

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

Although "first=True" should make this pipe run before the document is parsed.

Is it possible to use a custom segmentation with spacy-stanza?

adrianeboyd commented 1 year ago

In spacy-stanza, the whole stanza processing runs as part of the tokenizer step, which is run before any pipeline components.

I think you can provide pretokenized and sentence-per-line text to stanza as described here with additional options: https://github.com/explosion/spacy-stanza/#stanza-pipeline-options

All the options are passed through, so see if anything in their docs looks like what you want: https://stanfordnlp.github.io/stanza/tokenize.html

adrianeboyd commented 9 months ago

Please feel free to reopen if you're still running into issues!

explosion / spacy-stanza

Custom sentence segmentization #88