Dependency sentence segmenter handles newlines inconsistently between languages

freddyheppell commented 11 months ago

How to reproduce the behaviour

When parsing a sentence that contains newlines, the Italian parser sometimes assigns the newline to a sentence by itself, for example:

Ma regolamenta solo un settore, a differenza dell’azione a largo raggio dell’Inflation Act. \nI tentativi di legiferare per stimolare l’industria non hanno avuto molto successo.

Produces 3 sentences:

'Ma regolamenta solo un settore, a differenza dell’azione a largo raggio dell’Inflation Act (dalla sanità all’industria pesante).'
'\n'
'I tentativi di legiferare per stimolare l’industria non hanno avuto molto successo.'

There are various experiments with different combinations of punctuation in the notebook.

Looking at the tokens and their is_sent_start property, it seems under some circumstances the \n and I tokens are both assigned as the start of a new sentence.

I have not been able to cause this problem with en_core_web_sm, which always correctly identifies 2 sentences.

Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here, and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.

Your Environment

spaCy version: 3.6.1
Platform: Linux-5.15.120+-x86_64-with-glibc2.35
Python version: 3.10.12
Pipelines: it_core_news_sm (3.6.0), en_core_web_sm (3.6.0)

rmitsch commented 11 months ago

Thanks for reporting this!

Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here...

Can you elaborate on the inconsistency between languages?

...and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.

While that is a reasonable take, bear in mind that spaCy's pretrained models (such as it_core_news_xx) are trained on corpora of natural language. \n is a control character and not something that appears in natural (Italian or otherwise) text, so the performance of trained models will not be great here.

I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary).

freddyheppell commented 11 months ago

Can you elaborate on the inconsistency between languages?

I believe this behaviour occurs much more frequently in Italian than other languages. As well as the examples in the notebook where English seems able to identify 2 sentences where Italian gets 3, I'm working on a partially-parallel corpus and Italian has a mean sents/doc that's noticeably higher than any other language (21 vs 14-16), which makes me think it's an Italian-specific issue.

I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary).

I was hoping to use the parser approach because the docs don't have ideal punctuation, but I tried the sentencizer with \n added to the chars list and it seems fine actually. It's also closed the gap a bit between sents/doc for Italian vs others (it's now 19 vs 15-16), so I think that further suggests there's some behaviour difference in the dependency sentenceizer.

rmitsch commented 11 months ago

It's possible something is going wrong with the whitespace augmentation, which is only supposed to attach whitespace to the preceding token and not create new sentences. We might look into this at a later point.

We're using this augmentation with the corpus - feel free to have a closer look and/or train your own model with modified settings:

[corpora.train.augmenter]
@augmenters = "spacy.combined_augmenter.v1"
lower_level = 0.1
whitespace_level = 0.1
whitespace_per_token = 0.05
whitespace_variants = "[\" \",\"\\t\",\"\\n\",\"\\u000b\",\"\\f\",\"\\r\",\"\\u001c\",\"\\u001d\",\"\\u001e\",\"\\u001f\",\" \",\"\\u0085\",\"\\u00a0\",\"\\u1680\",\"\\u2000\",\"\\u2001\",\"\\u2002\",\"\\u2003\",\"\\u2004\",\"\\u2005\",\"\\u2006\",\"\\u2007\",\"\\u2008\",\"\\u2009\",\"\\u200a\",\"\\u2028\",\"\\u2029\",\"\\u202f\",\"\\u205f\",\"\\u3000\"]"
orth_level = 0.0
orth_variants = null

explosion / spaCy

Dependency sentence segmenter handles newlines inconsistently between languages #13059

How to reproduce the behaviour

Your Environment