fnl / syntok

Text tokenization and sentence segmentation (segtok v2)
MIT License
198 stars 33 forks source link

German ordinal numbers lead to over splitting #28

Open nickluger opened 2 years ago

nickluger commented 2 years ago

This is a German text containing ordinal numbers. (The original text passed to syntok does not contain \n. Just added for readability here).

Ich habe am 3. Juni Geburtstag. 
Jonas ist Fan vom 1. FC Köln, und du? 
Meine Eltern haben 6 Kinder. 
Dies ist nun der 17. Versuch. 
Friedrich II. war der Sohn von Heinrich VI

is split into the following parts:

"Ich habe am 3.",
"Juni Geburtstag.",
"Jonas ist Fan vom 1.",
"FC Köln, und du?",
"Meine Eltern haben 6 Kinder.",
"Dies ist nun der 17.",
"Versuch.",
"Friedrich II. war der Sohn von Heinrich VI",

I understand that this is very difficult to get right in German, where an uppercase word can follow the ordinal number.

Dates like 3. Juni might be maybe detectable, though. Interestingly, the last part does not split at Friedrich II., like all other examples.

Besides that, syntok seems to be a sublime sentence splitter for German, thank you for this. 🙏

fnl commented 2 years ago

Hi Nick, thanks for the kind words, and glad you find syntok useful.

Yes, agreed, for the German month-based date cases, setting up a few rules should be easy, I will try to do that asap.

For other ordinals, if we find enough hard data to support a sensible rule (like a number followed by a terminal and a sequence of upper case letters, e.g. "1. FC"), such rules could be added, in theory. But for that, I would like to see some statistics that we are not hurting performance overall, especially for other languages than German.

You probably have figured out by now, but the last example you showed does not over-split simply because the terminal is followed by a lower-case letter. However, in German, that particular use of ordinals typically only applies to the ordinals used in the names of noble people. Normally, an ordinal is followed by a (proper) noun, not proceeded - in which case the following letter would be an upper case, leading to the bad outcome ("17. Versuch") :(

nickluger commented 2 years ago

Hey Florian,

thank you for your comprehensive explanation.

Agree on performance, was trying an ML based tool for this too, which worked out one or two of these, but was much slower. Yes, dates would be a low-hanging fruit, and handle many cases. Also, two succeeding uppercase letters mostly indicate a proper noun in all targeted languages, but it depends what happens more often, in German it's rather difficult to create a (false positive) sentence that ends with a number, while the next one start with a proper noun.

"Das ist nun Sieg Nr 3. FC-Köln-Fans sind außer sich." is possible, but sounds a bit made-up 😄

For our case it's not super important to get everything 100% right, as we're feeding an ML tool with raw masses of sentences anyway and a tiny number of wrongly splitted will not cause us any headaches.

Thanks!

fnl commented 2 years ago

I queried the English Wikipedia with the following regex: /[^0-9A-Za-z][0-9]\. [A-Z][A-Z]+/ That immediately surfaces the following cases that indeed are proper sentence terminal usages of this pattern and should be split:

  1. "... channel 7. KBS also ..."
  2. "... and HIV-2. HIV-1 is the virus ..."
  3. "... with SSH-1. SSH-2 features ..."
  4. "... higher than 2. CAP of depth 2 ..."

Overall in English, with this pattern, I can only find the "1. FC" case that should not be split, but more importantly, a number of cases that should be split. Then I tried this pattern on the German Wikipedia, and found the following ordinal expressions that should not be split:

  1. "1. PD"
  2. "1. FFC"
  3. "7. US Armee"
  4. "4. ZK-Plenum"
  5. "4. ATP"
  6. "1. FDJ"
  7. etc. etc.

Therefore, it might be worth elevating the specific expression "1. FC" to a special no-split rule, as well as handling the day of the month, dot, name of month case for German month names. While it seems that preventing a segmentation on a number-dot-uppercase pattern is potentially going to lead to false negatives, even though in the German language (only?!), this pattern is pretty much always a no-split.

Any other thoughts or ideas? Any good ways to prove a different viewpoint?

fnl commented 2 years ago

Maybe, one would have to add a simple language detection algorithm to properly solve this case while still being open to any language that uses the Latin alphabet?

nickluger commented 2 years ago

Cool, didn't know one could regex search Wikipedia. The first sentences could easily appear in German too, though.

Therefore, I think the suggested 2-uppercase-following-rule would cause false negatives in any language, that allows starting sentences with the subject without article (and being a proper noun).

The month names + dot, though, appear quite often in most Latin alphabet languages and should deserve special treatment. I have to admit, I'm not proficient enough in Python (currently) to write a PR myself.

fnl commented 2 years ago

No worries, I can do those changes. Only, I have my plate quite full right now, with multiple issues pending. So it might take a week or two until I have a fix for this out there. Hope that's not a problem for you!

nickluger commented 2 years ago

Of course not, I'm grateful this library exists at all!

zerogerc commented 1 year ago

Hi, just want to mention that I've also stumbled upon that issue.

I have a concern about "simple language detection" algorithm as it can be quite tricky to detect a language. i.e. langdetect library doesn't properly work on short sentences. Moreover, there are cases when one language is embedded into another.

I would prefer to pass a language as a parameter into sentence segmentation as I already know the language of the sentences I want to split.