explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Sentence splitting is not working with multiple spaces after punctuation #31

Closed TahaMunir1 closed 4 years ago

TahaMunir1 commented 4 years ago

I am trying to split sentences into segments based on obvious punctuation marks like '.', '?', '!' and have been able to do so easily using Spacy Sentencizer in the pipeline. Now when I try to use Spacy-Stanza to split it, it works fine until there are multiple spaces after the punctuation mark.

snlp = stanza.Pipeline(lang='en') nlp = StanzaLanguage(snlp) doc = nlp('This is a test message. Second. Third? Fourth! Fifth')

I am getting this warning:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:

And this is the output:

['This is a test message.', 'Second. Third?', 'Fourth! Fifth']

How can I get the desired output? On adding Sentencizer in nlp pipeline, it gives error probably because the input it receives after processing of snlp is not in desired format (parsed). And when I add it to the processors of snlp, it makes no difference.

TahaMunir1 commented 4 years ago

So I managed to fix it using Sentencizer. It wasn't working with default implementation and Sentencizer wasn't working with custom tokenizer that I used. This is how we can improve sentence splitting with multiple whitespaces:

snlp = stanza.Pipeline(lang=a_lang_code)
nlp = StanzaLanguage(snlp)

sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)

comm = 'This is a test message. Second.  Third? Fourth! Fifth'
doc = nlp(comm)

for sent in doc.sents:
    print(sent)