Open freddyheppell opened 11 months ago
Thanks for reporting this!
Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here...
Can you elaborate on the inconsistency between languages?
...and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.
While that is a reasonable take, bear in mind that spaCy's pretrained models (such as it_core_news_xx
) are trained on corpora of natural language. \n
is a control character and not something that appears in natural (Italian or otherwise) text, so the performance of trained models will not be great here.
I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary).
Can you elaborate on the inconsistency between languages?
I believe this behaviour occurs much more frequently in Italian than other languages. As well as the examples in the notebook where English seems able to identify 2 sentences where Italian gets 3, I'm working on a partially-parallel corpus and Italian has a mean sents/doc that's noticeably higher than any other language (21 vs 14-16), which makes me think it's an Italian-specific issue.
I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary).
I was hoping to use the parser approach because the docs don't have ideal punctuation, but I tried the sentencizer with \n
added to the chars list and it seems fine actually. It's also closed the gap a bit between sents/doc for Italian vs others (it's now 19 vs 15-16), so I think that further suggests there's some behaviour difference in the dependency sentenceizer.
It's possible something is going wrong with the whitespace augmentation, which is only supposed to attach whitespace to the preceding token and not create new sentences. We might look into this at a later point.
We're using this augmentation with the corpus - feel free to have a closer look and/or train your own model with modified settings:
[corpora.train.augmenter]
@augmenters = "spacy.combined_augmenter.v1"
lower_level = 0.1
whitespace_level = 0.1
whitespace_per_token = 0.05
whitespace_variants = "[\" \",\"\\t\",\"\\n\",\"\\u000b\",\"\\f\",\"\\r\",\"\\u001c\",\"\\u001d\",\"\\u001e\",\"\\u001f\",\" \",\"\\u0085\",\"\\u00a0\",\"\\u1680\",\"\\u2000\",\"\\u2001\",\"\\u2002\",\"\\u2003\",\"\\u2004\",\"\\u2005\",\"\\u2006\",\"\\u2007\",\"\\u2008\",\"\\u2009\",\"\\u200a\",\"\\u2028\",\"\\u2029\",\"\\u202f\",\"\\u205f\",\"\\u3000\"]"
orth_level = 0.0
orth_variants = null
How to reproduce the behaviour
Colab notebook demonstrating problem
When parsing a sentence that contains newlines, the Italian parser sometimes assigns the newline to a sentence by itself, for example:
Produces 3 sentences:
There are various experiments with different combinations of punctuation in the notebook.
Looking at the tokens and their
is_sent_start
property, it seems under some circumstances the\n
andI
tokens are both assigned as the start of a new sentence.I have not been able to cause this problem with
en_core_web_sm
, which always correctly identifies 2 sentences.Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here, and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.
Your Environment