Use Python NLTK tokenizer for German

guerda commented 2 years ago

% Use NTLK Punkt implementation for German, for improvements with abbreviations

all special abbreviations as a baseline

Discussed in #157 and https://discourse.mozilla.org/t/tokenizer-trennt-haufig-satze-auf/87288/11

MichaelKohler commented 2 years ago

Thanks for creating this PR! I had a very quick look at the sample extraction.

A few examples that look like it didn't fully get split nicely, but I haven't looked into that more closely.

August den zweiteiligen TV-Film "Paparazzo".
Periode, d-Block.
B. der Ostkaribische Dollar ("XCD")), als auch bei den IWF-Sonderziehungsrechten ("XDR").
Internationalen Filmfestspiele Berlin.
Eine Ausnahme stellt aber z.
B. in seinem Debüt "The Connection".
B. den Schulenstreit in der Psychotherapie).
Alexander dehnte die Grenzen des Reiches, das sein Vater Philipp II.
Jahrhundert stammt.
Jahrhundert v.
Jahrhundert v. Chr.
Den größten Erfolg hatte er mit "The Prisoner of Zenda" (dt.
In der EU und der Schweiz ist der Ar bzw.
Eine solche Untersuchung hat Körner am Beispiel von "Duden.
Ein Schwerpunkt der Fundanalyse ist die Datierung der Befunde (z.
B. Profisportler) wird noch eine operative Behandlung empfohlen.
B. Punkte, Striche) zu den Konsonanten gesetzt werden konnten (z.
April ein.
Daher muss sie reine Aktualität, Tätigkeit sein (Met.

On the other hand, it doesn't seem to have had issues with the following two sentences:

Sein Gegenkandidat war Senator Stephen A. Douglas, der Führer der Demokraten auf Unionsebene.
Andere Mittelmeerländer kennen im November den "St.-Martins-Sommer".

Generally it's very hard to say right now if it's better. Let me try to get a sample export with rust-punkt from the same data. Then we can better compare, I hope.

MichaelKohler commented 2 years ago

I've created a review Google Sheet to track 1000 sample sentences from both extracts. Not sure yet what the best process for comparison is, but at least we're not losing the sentences like this.

https://docs.google.com/spreadsheets/d/1n8Hk4v7Tw2JeBXVuPfMwdxjy0QRPsLJt_8t7EmfCKho/edit?usp=sharing

guerda commented 2 years ago

I reviewed the comments and feedback, thanks again. I now load and use the German model correctly and the results look promising to me. One problem I discovered is that German abbreviations are still not handled very well. Prominent example is "z. B." which is used quite frequently in Wikipedia. I'm currently trying to find out if I can use the NTLK custom abbreviations list in the pretrained German model, but I'm not optimistic. Still I see improvements over the rust punkt.

Here's the extract of 1000 sentences: https://docs.google.com/spreadsheets/d/17J4KZnYKAulqWEZ8S-N7i8P9LiYyBXcmQDqIr1NQDqc/edit?usp=sharing

I only found seldom errors besides the "z. B." case. "Generationen" was one where I suspect a counting beforehand, which caused a faulty sentence splitting.

Looking through all 1000 examples, I found 8% errors. Examples are "z. B.", "Generation", "ca.", "dt."

MichaelKohler commented 2 years ago

Yes, that indeed looks way better, and I'm pretty sure that's also better than rust-punkt. Thanks for putting effort into this!

After having skimmed through the 1000 examples, I would suggest the following approach here:

Let's remove the abbreviations from the nltk params (see following step further down)
I quickly verify the samples once that is done to make sure nothing broke majorly
We merge this PR, and rebase https://github.com/common-voice/cv-sentence-extractor/pull/157/files
Then let's add the abbreviations you used as `replacements´ in the rules file in this PR https://github.com/common-voice/cv-sentence-extractor/pull/157/files. This way we can gather even more valid sentences.
Review the DE rules PR
Re-run the extraction based on the successful pre-validation of the new rules file.

Does that sound reasonable? cc @stefangrotz

guerda commented 2 years ago

Thanks for reviewing! Do you want to incorporate the additional_abbreviations into the rules file? I'm not experienced in Rust so I don't know how I can access those settings from the Python snippet which I use in the segmenter.rs. I would need some help there.

Do you see step 1 as prerequisite for step 2? If so, I can remove it accordingly or you just use commit acd4e64, which is basically that.

guerda commented 2 years ago

While skimming #162, I got an idea how to fix the pesky "z. B." problem. We could use the replacement "z. B." --> "zum Beispiel" or "z.B." as this replacement is performed before sentence tokenization. "z.B." is recognized by punkt correctly and would not cause a sentence to be split incorrectly. What do you think?

MichaelKohler commented 2 years ago

"z.B." is recognized by punkt correctly and would not cause a sentence to be split incorrectly.

Yes, but then the full sentence would need to be discarded as it contains an abbreviation. If we replace it with "zum Beispiel" before everything else we end up with a possibly valid sentence :)

Do you want to incorporate the additional_abbreviations into the rules file? I'm not experienced in Rust so I don't know how I can access those settings from the Python snippet which I use in the segmenter.rs. I would need some help there.

I think I'll just ask Stefan if he wants to include it in his PR. Would IMHO make sense to do it all in one for easier validation.

Do you see step 1 as prerequisite for step 2? If so, I can remove it accordingly or you just use commit acd4e64, which is basically that.

I think removing would be good. Then we can merge this PR and don't lose this whole discussion.

guerda commented 2 years ago

Yes, but then the full sentence would need to be discarded as it contains an abbreviation. If we replace it with "zum Beispiel" before everything else we end up with a possibly valid sentence :)

Makes sense.

I think removing would be good. Then we can merge this PR and don't lose this whole discussion. Will do.

Here's the list I created based on the sample extraction:

"z. b", "z.b", "ca", "dt"

Be aware that the dot after the abbreviation must be left out in order to work with the PunktSentenceTokenizer.

I removed the custom abbreviations from my branch. Code is still available at commit c3fe828

MichaelKohler commented 2 years ago

Thanks, I took the liberty to re-add relevant rule parts again, which are still needed even after massively improving the segmentation. I will have a very quick look at the extract generated and merge this.

Thank you so much for these improvements!

common-voice / cv-sentence-extractor

Use Python NLTK tokenizer for German #165