common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Improve DE rules file #157

Closed stefangrotz closed 2 years ago

stefangrotz commented 3 years ago

I thought there were more things to improve, but this is the only thing I found. After that imo we are ready for a second extraction.

stefangrotz commented 3 years ago

What do you mean by "once it is ready"? Will the extraction happen in GitHub or do I have to do it on my machine?

MichaelKohler commented 3 years ago

Every commit on a PR automatically triggers a sample extraction on GitHub (second one in the row below)

image

If you click on "Details", and then "Artifacts", you can download it:

image

In this case here it extracted 15k sentences, which is pretty nice to do validation on without having to run anything locally.

stefangrotz commented 3 years ago

Okay, nice. Here is a spreadsheet with a few hundred random sentences from the export: https://docs.google.com/spreadsheets/d/1WRjnz-rWeJwZenl5V71At5Q8CQdBk4A5cIKpuSXf8Ds/edit?usp=sharing

Looks good on first sight, the only avoidable error I see more than once are non-closing brackets, so single "(" or ")".

The complete export contains around 15 000 sentences.

MichaelKohler commented 3 years ago

Good point, let's add this to matching_symbols:

matching_symbols = [
  ["„", "“"],
  ["(", ")"]
]
stefangrotz commented 3 years ago

Done, I will update the sheet tomorrow.

stefangrotz commented 3 years ago

Thanks. I updated the spreadsheet with a sample of the new import. This time the complete import contains over 18000 sentences.

MichaelKohler commented 3 years ago

I'd say this generally looks good to me. Couldn't mark anything as I don't have edit permissions, but that's fine.

Two thoughts:

stefangrotz commented 3 years ago

I corrected the allowed symbols, added even symbols and added abbreviation ggf. But now a syntax error in broken_whitespace is highlighted. any idea whats wrong there?

stefangrotz commented 3 years ago

Thanks. I updated the evaluation sheet with a new sample: https://docs.google.com/spreadsheets/d/1WRjnz-rWeJwZenl5V71At5Q8CQdBk4A5cIKpuSXf8Ds/edit#gid=0

MichaelKohler commented 3 years ago

Another idea here would be to look into the most common abbreviations and expand them through replacements. Generally the newer articles are smaller and that might actually make a difference.

I'm seeing quite a lot of hard words, might be worth a try to re-run the blocklist generation. What do you think?

guerda commented 3 years ago

I would also add "Dr." to the abbreviation list, as discussed in discourse. See my PR here: https://github.com/stefangrotz/cv-sentence-extractor/pull/1

stefangrotz commented 3 years ago

Hey @guerda thanks, I added Dr. and Prof. , I am sure there are a lot of other possible abbreviations that we could add, I will think about this and collect more common abbreviations.

@MichaelKohler Re-running the blocklist is definitely a good idea after so much time. Maybe it would be good to put the numbers a little higher to avoid uncommon words.

MichaelKohler commented 3 years ago

/action blocklist de 80

Let's see if 80 is fine. We've used 80 for quite a few other languages. We can experiment a bit and go further down if needed.

github-actions[bot] commented 3 years ago

Job started: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1360157048

guerda commented 3 years ago

Hey @guerda thanks, I added Dr. and Prof. , I am sure there are a lot of other possible abbreviations that we could add, I will think about this and collect more common abbreviations.

Great, thank you! I'm already maintaining a small list of abbreviations which I can imagine and one with words which appear after a number and occur in the sentences I speak or listen to.

MichaelKohler commented 3 years ago

maintaining a small list of abbreviations

Let's add it as replacements, we don't need to be super strict on this, but the more we can place in replacements the less odd splitting we'll have. Of course that won't work with things like "3. Jahrhundert", but for things like "Dr." it would probably make sense. I would probably go for the top 10 or so abbreviations.

one with words which appear after a number

Doesn't necessarily have to be after a number, that can also happen with other abbreviations (see above for that solution). However adding a few that are in the same category as "3. Jahrhundert" definitely wouldn't hurt.

Did you already have a chance to look into NLTK? We absolutely don't have any stress to re-run the extraction, just wondering as that might be quite beneficial before we trigger the re-run :)

MichaelKohler commented 3 years ago

https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/de/abbr.tsv might serve as inspiration as well

github-actions[bot] commented 3 years ago

Job finished: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1360157048 Don't forget to download the artifacts.

MichaelKohler commented 3 years ago

@stefangrotz wanna replace the blocklist with the new one from https://github.com/common-voice/cv-sentence-extractor/suites/4099077919/artifacts/104602005? That's based off 80 occurances, we can tweak as we see fit though.

stefangrotz commented 3 years ago

Sounds good to me. Is this a task for me or just a question?

I will have more time in a few days and add the abbreviations from the list then.

MichaelKohler commented 3 years ago

That's just a question if you wanna do it. Otherwise I can also create a PR against your repo and you just need to merge it. Up to you :)

MichaelKohler commented 3 years ago

FYI, if you want to add replacements please rebase this branch on top of latest main, I fixed a bug with that.

MichaelKohler commented 3 years ago

Replacements I would suggest, completely based on what @guerda has figured out in https://github.com/common-voice/cv-sentence-extractor/pull/165/files:

replacements = [
  ["z.B.", "zum Beispiel"],
  ["z. B.", "zum Beispiel"],
  ["ca.", "circa"],
]

I would ignore "dt", as it's very likely the part before it wouldn't be German anyway. @stefangrotz what do you think? As said before, happy to create a PR against your branch if you want me to.

MichaelKohler commented 3 years ago

Ah, and once https://github.com/common-voice/cv-sentence-extractor/pull/165 is merged, we should rebase this branch here for the better sentence separation.

guerda commented 3 years ago

I would ignore "dt", as it's very likely the part before it wouldn't be German anyway.

There were examples in the extraction with dt. as an abbreviation for Deutsch, e.g. Deutsche Einheit. But I agree, this might be a seldom occurrence

MichaelKohler commented 3 years ago

There were examples in the extraction with dt. as an abbreviation for Deutsch, e.g. Deutsche Einheit. But I agree, this might be a seldom occurrence

And for those cases we couldn't guess the correct form when replacing, i.e. deutsch, deutsches, deutsche, etc.

MichaelKohler commented 3 years ago

I noticed that I have write access here as well, and had some time, so I rebased the branch to include the latest changes by @guerda in regards to sentence splitting and I pushed some replacements based on what @guerda found, as well as the list provided by https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/de/abbr.tsv.

I hope you don't mind the intrusion, and sorry for the many commits fixing my previous mistake.

Here's the latest extract: https://docs.google.com/spreadsheets/d/1Q_IO0GLg8vdjyYEMlCctL5sB8QYMEPBhbYgUF6aheyM/edit#gid=0

Please have a look at that as well as the added replacements. Looks like we indeed will also want a blocklist.

MichaelKohler commented 3 years ago

/action blocklist de 40

github-actions[bot] commented 3 years ago

Job started: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1392104652

github-actions[bot] commented 3 years ago

Job finished: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1392104652 Don't forget to download the artifacts.

MichaelKohler commented 2 years ago

I reviewed a few more sentences in https://docs.google.com/spreadsheets/d/1Q_IO0GLg8vdjyYEMlCctL5sB8QYMEPBhbYgUF6aheyM/edit#gid=0 . What I noticed is that we probably could add "Abs." to the replacements as well.

MichaelKohler commented 2 years ago

I've merged this as the extraction pipeline by now seems to have some issues. I will create a follow-up PR for needed changes and an additional review.