TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

sentence span is wrong if there are sentences containing only space tokens #42

Open jwijffels opened 2 years ago

jwijffels commented 2 years ago

The sentence span is wrong if there are sentences containing only space tokens

>>> import spacy
>>> import spacy_udpipe
>>> spacy_udpipe.download("nl")
Already downloaded a model for the 'nl' language
>>> nlp = spacy_udpipe.load("nl")
>>>
>>> def line_splitter(x):
...     text = str(x)
...     text = text.split(sep = "\n")
...     text = [sent + "\n" for sent in text]
...     return text
...
>>> text_raw = "We gingen naar Brussel \n\n \nen kochten op 13/12/2021 veel eten. Jullie ook?"
>>> text = line_splitter(text_raw)
>>> text
['We gingen naar Brussel \n', '\n', ' \n', 'en kochten op 13/12/2021 veel eten. Jullie ook?\n']
>>> doc = nlp(text)
>>> for sent_i, sent in enumerate(doc.sents):
...     print(sent.start_char, sent.end_char)
...
0 22
23 70
>>> text_raw[0:(22+1)]
'We gingen naar Brussel '
>>> text_raw[23:(70+1)]
'\n\n \nen kochten op 13/12/2021 veel eten. Jullie o'
>>>
jwijffels commented 2 years ago

the reason being of course that UDPipe does not return tokens as all spaces are in the misc column.