common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Use Python NLTK tokenizer for German #165

Closed guerda closed 2 years ago

guerda commented 2 years ago

% Use NTLK Punkt implementation for German, for improvements with abbreviations

Discussed in #157 and https://discourse.mozilla.org/t/tokenizer-trennt-haufig-satze-auf/87288/11

MichaelKohler commented 2 years ago

Thanks for creating this PR! I had a very quick look at the sample extraction.

A few examples that look like it didn't fully get split nicely, but I haven't looked into that more closely.

On the other hand, it doesn't seem to have had issues with the following two sentences:

Generally it's very hard to say right now if it's better. Let me try to get a sample export with rust-punkt from the same data. Then we can better compare, I hope.

MichaelKohler commented 2 years ago

I've created a review Google Sheet to track 1000 sample sentences from both extracts. Not sure yet what the best process for comparison is, but at least we're not losing the sentences like this.

https://docs.google.com/spreadsheets/d/1n8Hk4v7Tw2JeBXVuPfMwdxjy0QRPsLJt_8t7EmfCKho/edit?usp=sharing

guerda commented 2 years ago

I reviewed the comments and feedback, thanks again. I now load and use the German model correctly and the results look promising to me. One problem I discovered is that German abbreviations are still not handled very well. Prominent example is "z. B." which is used quite frequently in Wikipedia. I'm currently trying to find out if I can use the NTLK custom abbreviations list in the pretrained German model, but I'm not optimistic. Still I see improvements over the rust punkt.

Here's the extract of 1000 sentences: https://docs.google.com/spreadsheets/d/17J4KZnYKAulqWEZ8S-N7i8P9LiYyBXcmQDqIr1NQDqc/edit?usp=sharing

I only found seldom errors besides the "z. B." case. "Generationen" was one where I suspect a counting beforehand, which caused a faulty sentence splitting.

Looking through all 1000 examples, I found 8% errors. Examples are "z. B.", "Generation", "ca.", "dt."

MichaelKohler commented 2 years ago

Yes, that indeed looks way better, and I'm pretty sure that's also better than rust-punkt. Thanks for putting effort into this!

After having skimmed through the 1000 examples, I would suggest the following approach here:

Does that sound reasonable? cc @stefangrotz

guerda commented 2 years ago

Thanks for reviewing! Do you want to incorporate the additional_abbreviations into the rules file? I'm not experienced in Rust so I don't know how I can access those settings from the Python snippet which I use in the segmenter.rs. I would need some help there.

Do you see step 1 as prerequisite for step 2? If so, I can remove it accordingly or you just use commit acd4e64, which is basically that.

guerda commented 2 years ago

While skimming #162, I got an idea how to fix the pesky "z. B." problem. We could use the replacement "z. B." --> "zum Beispiel" or "z.B." as this replacement is performed before sentence tokenization. "z.B." is recognized by punkt correctly and would not cause a sentence to be split incorrectly. What do you think?

MichaelKohler commented 2 years ago

"z.B." is recognized by punkt correctly and would not cause a sentence to be split incorrectly.

Yes, but then the full sentence would need to be discarded as it contains an abbreviation. If we replace it with "zum Beispiel" before everything else we end up with a possibly valid sentence :)

Do you want to incorporate the additional_abbreviations into the rules file? I'm not experienced in Rust so I don't know how I can access those settings from the Python snippet which I use in the segmenter.rs. I would need some help there.

I think I'll just ask Stefan if he wants to include it in his PR. Would IMHO make sense to do it all in one for easier validation.

Do you see step 1 as prerequisite for step 2? If so, I can remove it accordingly or you just use commit acd4e64, which is basically that.

I think removing would be good. Then we can merge this PR and don't lose this whole discussion.

guerda commented 2 years ago

Yes, but then the full sentence would need to be discarded as it contains an abbreviation. If we replace it with "zum Beispiel" before everything else we end up with a possibly valid sentence :)

Makes sense.

I think removing would be good. Then we can merge this PR and don't lose this whole discussion. Will do.

Here's the list I created based on the sample extraction:

"z. b", "z.b", "ca", "dt"

Be aware that the dot after the abbreviation must be left out in order to work with the PunktSentenceTokenizer.

I removed the custom abbreviations from my branch. Code is still available at commit c3fe828

MichaelKohler commented 2 years ago

Thanks, I took the liberty to re-add relevant rule parts again, which are still needed even after massively improving the segmentation. I will have a very quick look at the extract generated and merge this.

Thank you so much for these improvements!