Closed stefangrotz closed 2 years ago
What do you mean by "once it is ready"? Will the extraction happen in GitHub or do I have to do it on my machine?
Every commit on a PR automatically triggers a sample extraction on GitHub (second one in the row below)
If you click on "Details", and then "Artifacts", you can download it:
In this case here it extracted 15k sentences, which is pretty nice to do validation on without having to run anything locally.
Okay, nice. Here is a spreadsheet with a few hundred random sentences from the export: https://docs.google.com/spreadsheets/d/1WRjnz-rWeJwZenl5V71At5Q8CQdBk4A5cIKpuSXf8Ds/edit?usp=sharing
Looks good on first sight, the only avoidable error I see more than once are non-closing brackets, so single "(" or ")".
The complete export contains around 15 000 sentences.
Good point, let's add this to matching_symbols
:
matching_symbols = [
["„", "“"],
["(", ")"]
]
Done, I will update the sheet tomorrow.
Thanks. I updated the spreadsheet with a sample of the new import. This time the complete import contains over 18000 sentences.
I'd say this generally looks good to me. Couldn't mark anything as I don't have edit permissions, but that's fine.
Two thoughts:
"
to even_symbols
as it appears in a few sentences without a counter part. However I will check why that even shows up, as in theory it's not an allowed symbol (though it probably should be).ggf\\.$
to the other pattern as it should not be at the end of a sentenceI corrected the allowed symbols, added even symbols and added abbreviation ggf. But now a syntax error in broken_whitespace is highlighted. any idea whats wrong there?
Thanks. I updated the evaluation sheet with a new sample: https://docs.google.com/spreadsheets/d/1WRjnz-rWeJwZenl5V71At5Q8CQdBk4A5cIKpuSXf8Ds/edit#gid=0
Another idea here would be to look into the most common abbreviations and expand them through replacements
. Generally the newer articles are smaller and that might actually make a difference.
I'm seeing quite a lot of hard words, might be worth a try to re-run the blocklist generation. What do you think?
I would also add "Dr." to the abbreviation list, as discussed in discourse. See my PR here: https://github.com/stefangrotz/cv-sentence-extractor/pull/1
Hey @guerda thanks, I added Dr. and Prof. , I am sure there are a lot of other possible abbreviations that we could add, I will think about this and collect more common abbreviations.
@MichaelKohler Re-running the blocklist is definitely a good idea after so much time. Maybe it would be good to put the numbers a little higher to avoid uncommon words.
/action blocklist de 80
Let's see if 80 is fine. We've used 80 for quite a few other languages. We can experiment a bit and go further down if needed.
Hey @guerda thanks, I added Dr. and Prof. , I am sure there are a lot of other possible abbreviations that we could add, I will think about this and collect more common abbreviations.
Great, thank you! I'm already maintaining a small list of abbreviations which I can imagine and one with words which appear after a number and occur in the sentences I speak or listen to.
maintaining a small list of abbreviations
Let's add it as replacements
, we don't need to be super strict on this, but the more we can place in replacements
the less odd splitting we'll have. Of course that won't work with things like "3. Jahrhundert", but for things like "Dr." it would probably make sense. I would probably go for the top 10 or so abbreviations.
one with words which appear after a number
Doesn't necessarily have to be after a number, that can also happen with other abbreviations (see above for that solution). However adding a few that are in the same category as "3. Jahrhundert" definitely wouldn't hurt.
Did you already have a chance to look into NLTK? We absolutely don't have any stress to re-run the extraction, just wondering as that might be quite beneficial before we trigger the re-run :)
https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/de/abbr.tsv might serve as inspiration as well
Job finished: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1360157048 Don't forget to download the artifacts.
@stefangrotz wanna replace the blocklist with the new one from https://github.com/common-voice/cv-sentence-extractor/suites/4099077919/artifacts/104602005? That's based off 80 occurances, we can tweak as we see fit though.
Sounds good to me. Is this a task for me or just a question?
I will have more time in a few days and add the abbreviations from the list then.
That's just a question if you wanna do it. Otherwise I can also create a PR against your repo and you just need to merge it. Up to you :)
FYI, if you want to add replacements
please rebase this branch on top of latest main
, I fixed a bug with that.
Replacements I would suggest, completely based on what @guerda has figured out in https://github.com/common-voice/cv-sentence-extractor/pull/165/files:
replacements = [
["z.B.", "zum Beispiel"],
["z. B.", "zum Beispiel"],
["ca.", "circa"],
]
I would ignore "dt", as it's very likely the part before it wouldn't be German anyway. @stefangrotz what do you think? As said before, happy to create a PR against your branch if you want me to.
Ah, and once https://github.com/common-voice/cv-sentence-extractor/pull/165 is merged, we should rebase this branch here for the better sentence separation.
I would ignore "dt", as it's very likely the part before it wouldn't be German anyway.
There were examples in the extraction with dt. as an abbreviation for Deutsch, e.g. Deutsche Einheit. But I agree, this might be a seldom occurrence
There were examples in the extraction with dt. as an abbreviation for Deutsch, e.g. Deutsche Einheit. But I agree, this might be a seldom occurrence
And for those cases we couldn't guess the correct form when replacing, i.e. deutsch, deutsches, deutsche, etc.
I noticed that I have write access here as well, and had some time, so I rebased the branch to include the latest changes by @guerda in regards to sentence splitting and I pushed some replacements
based on what @guerda found, as well as the list provided by https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/de/abbr.tsv.
I hope you don't mind the intrusion, and sorry for the many commits fixing my previous mistake.
Here's the latest extract: https://docs.google.com/spreadsheets/d/1Q_IO0GLg8vdjyYEMlCctL5sB8QYMEPBhbYgUF6aheyM/edit#gid=0
Please have a look at that as well as the added replacements. Looks like we indeed will also want a blocklist.
/action blocklist de 40
Job finished: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1392104652 Don't forget to download the artifacts.
I reviewed a few more sentences in https://docs.google.com/spreadsheets/d/1Q_IO0GLg8vdjyYEMlCctL5sB8QYMEPBhbYgUF6aheyM/edit#gid=0 . What I noticed is that we probably could add "Abs." to the replacements as well.
I've merged this as the extraction pipeline by now seems to have some issues. I will create a follow-up PR for needed changes and an additional review.
I thought there were more things to improve, but this is the only thing I found. After that imo we are ready for a second extraction.