Sentence splitting is not working with multiple spaces after punctuation

explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy

MIT License

726 stars 60 forks source link

I am trying to split sentences into segments based on obvious punctuation marks like '.', '?', '!' and have been able to do so easily using Spacy Sentencizer in the pipeline. Now when I try to use Spacy-Stanza to split it, it works fine until there are multiple spaces after the punctuation mark.

snlp = stanza.Pipeline(lang='en') nlp = StanzaLanguage(snlp) doc = nlp('This is a test message. Second. Third? Fourth! Fifth')

I am getting this warning:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:

And this is the output:

['This is a test message.', 'Second. Third?', 'Fourth! Fifth']

How can I get the desired output? On adding Sentencizer in nlp pipeline, it gives error probably because the input it receives after processing of snlp is not in desired format (parsed). And when I add it to the processors of snlp, it makes no difference.

snlp = stanza.Pipeline(lang=a_lang_code) nlp = StanzaLanguage(snlp) sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer) comm = 'This is a test message. Second. Third? Fourth! Fifth' doc = nlp(comm) for sent in doc.sents: print(sent)

explosion / spacy-stanza

Sentence splitting is not working with multiple spaces after punctuation #31