Closed TahaMunir1 closed 4 years ago
So I managed to fix it using Sentencizer. It wasn't working with default implementation and Sentencizer wasn't working with custom tokenizer that I used. This is how we can improve sentence splitting with multiple whitespaces:
snlp = stanza.Pipeline(lang=a_lang_code)
nlp = StanzaLanguage(snlp)
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
comm = 'This is a test message. Second. Third? Fourth! Fifth'
doc = nlp(comm)
for sent in doc.sents:
print(sent)
I am trying to split sentences into segments based on obvious punctuation marks like '.', '?', '!' and have been able to do so easily using Spacy Sentencizer in the pipeline. Now when I try to use Spacy-Stanza to split it, it works fine until there are multiple spaces after the punctuation mark.
snlp = stanza.Pipeline(lang='en') nlp = StanzaLanguage(snlp) doc = nlp('This is a test message. Second. Third? Fourth! Fifth')
I am getting this warning:
And this is the output:
How can I get the desired output? On adding Sentencizer in nlp pipeline, it gives error probably because the input it receives after processing of snlp is not in desired format (parsed). And when I add it to the processors of snlp, it makes no difference.