Closed SeaLiteral closed 1 year ago
I just remembered which Danish word I've often thought about that changes its pronunciation depending on capitalization: Mors. Uncapitalized, mors (/moɐ̯s/) means "mum's", but capitalized, Mors (/mɒːs/) is the name of an island between Jutland and Thy. It's a place in Denmark, so more likely to get mentioned than the foreign names I mentioned in the description above. The distinction in pronunciation can depends on the accent, but in speech that's intended for a large audience or for transcription, distinguishing them seems a lot more common than pronouncing them the same way.
And I don't really know how speech recognition tools in general handle that type of words. I'd guess it's common to add extra words and say things like "øen Mors" ("the island of Mors") when there's a need for an accurate transcription, but that's based on having read subtitles in a different language (Spanish) where "sí" (meaning "yes", but it sounds very similar to a word meaning "if") got transcribed as "efectivamente" ("that's right"). The average person doesn't add extra words to give hints about difficult words.
Just a quick drive-by here, haven't had much time to look closer into this, but from what I've read I think we could already do parts of that. Every language can have a validation file which will run on all uploaded sentences. Sentences that don't match will be rejected. Currently the categories are quite constrained, so we might need a new category for this (as I said, haven't looked closely at this one here). This is how it would need to be done: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/VALIDATION.md
I have a tendency to say that sentence with old spelling should be rejected. Most of the sentences from "Excentriske noveller" are bad, both in terms of "aa" and the capital nouns as well as some dated language. I have rejected many, while accepting a few here and there. Finding public domain Danish texts with modern language is a bit of a task. The laws of Denmark should be ok and I have added a few from https://www.retsinformation.dk/eli/lta/2019/936
I now managed to complete the sentences that were shown to me for review. Most of the "Excentriske noveller" sentences I rejected. The other sentences was mostly ok.
I have now discovered https://github.com/common-voice/common-voice/blob/main/server/data/da/sentence-collector.txt and see some sentences are questionable:
Validation rules can now have an arbitrary amount of rules and each of these rules can define their own error message. Therefore I'm going to close this as this can be solved by adding a validation for Danish. More information can be found on https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/VALIDATION.md.
Someone have added lots of sentences from www.andersstories.com/da/andersen_fortaellinger/list. I think they all should be erased. For instance, "Nu er han vel under jorden og du med ide". While it contains no spelling that we can detect with a modern word-based spell checker, there is something odd with the sentence and e.g., "ide" should probably be a "Ide" to indicate a proper name for a person and not "idé". "Knap var hun paa ord, tvær af mine, men villig til sin dont". This is very antiquated language. In this case one could perhaps add a rule on "aa" (for the "paa" word), but in general there are lost of old sounding sentences that degrade the Common Voice sentence for Danish.
In https://github.com/common-voice/common-voice/blob/main/server/data/da/sentence-collector.txt strange things like
Should one pull request on that file?
Hi, thanks for reporting this. That file gets automatically exported to from the Sentence Collector. Any PR to that file would be overwritten again on the next export, so these sentences would need to be deleted in the Sentence Collector database.
If you identify any wrong sentence, you can look at its details by using this link and adjusting the sentence at the end: https://commonvoice.mozilla.org/sentence-collector/sentences/da?sentence=Andersen%20-%20stemmer%20solskinshistoriernu%20skal%20jeg%20fort%C3%A6lle
Or alternatively you can find all the sentences here: https://commonvoice.mozilla.org/sentence-collector/sentences/da
Every sentence has a "source" and a "batch" field. If you notice a pattern there, such as everything coming from the same source or even the same "batch", we can delete based on that. However for both cases we'd need to be a bit careful to not delete a lot of valid sentences with it as well.
We can also delete sentence by sentence if you can give me a text file with a sentence per line that should be deleted.
In the end, I'd like to have a Danish validation file to avoid these mistakes in the future. While deleting sentences is doable, it requires quite some effort to identify sentences to be deleted. If we can come up with a validation file, we could tackle the issue on its core and not even allow sentences to be uploaded that would later on need to be deleted.
Someone have added lots of sentences from www.andersstories.com/da/andersen_fortaellinger/list. I think they all should be erased. For instance, "Nu er han vel under jorden og du med ide". While it contains no spelling that we can detect with a modern word-based spell checker, there is something odd with the sentence and e.g., "ide" should probably be a "Ide" to indicate a proper name for a person and not "idé". "Knap var hun paa ord, tvær af mine, men villig til sin dont". This is very antiquated language. In this case one could perhaps add a rule on "aa" (for the "paa" word), but in general there are lost of old sounding sentences that degrade the Common Voice sentence for Danish.
I agree.
All sentences from "https://www.andersenstories.com/da/andersen_fortaellinger/*" should be removed.
Very few of the 20,000+ sentences are correct sentences in modern Danish. This is not surprising since the texts were mostly written in the 1830'ies.
I don't see how a validation file can easily be produced that would identify even a minority of these sentences. The many incomplete sentences aside, the language used by HC. Andersen is not valid Danish today for both obvious and subtle reasons, eg.:
I wrote a book (called Creative Commons for alle) some years ago. I was wondering if the text could be of any use for our collection of sentences. I think the text should be cleaned, so that numbers and english names (like Creative Commons) is changed to something more generic like "hest" But if I do clean the text how do I then import 70 pages of text. With the sentence collector or by uploading a file to...? Thank you. I would like to see this project succeed!
I was wondering if the text could be of any use for our collection of sentences.
I think that would be fine! I wonder if English names are a problem.
But if I do clean the text how do I then import 70 pages of text. With the sentence collector or by uploading a file to...? Thank you. I would like to see this project succeed!
Depends on the amount of sentences. 70 pages of text might not qualify for the bulk upload, though I'm not sure how hard that limit is (haven't recently checked text submitted through that, so might be worth a try). You can find more info in the "Bulk submission" section at https://common-voice.github.io/community-playbook/sub_pages/text.html.
I agree with all points above - the HC Andersen texts are old and do not represent modern Danish. Furthermore, the import seems to have many errors.
I have cleaned up the sentence collector file before seeing that the sentence collector does not use pull requests.
Is there a way to edit the sentences in the database instead of deleting them? I can extract the lines I have identified that need to be edited or deleted, and we could delete all. I do not think it is a loss of quality for the database to delete them all instead of doing the edits below.
I have:
You can see the changes here: https://github.com/common-voice/common-voice/compare/main...Strads:common-voice:main
This still leaves a lot of "old Danish" sentences in the dataset, but at least we do not have to speak each sentence twice! ( I am not sure I have ever recorded anything not from HC Andersen - which makes me think the common voice dataset will work fantasticly for fairy tales)
Attached is a file with sentences that are duplicates (some maybe not be valid due to being detected as duplicates after removing leading symbols), the "hans christian..." sentences, and the sentences not transcribed properly. The sentences with leading symbols are not included. sentences_to_remove_da_sentence-collector.txt
@jakobkappel would you also agree that the sentences in the sentences_to_remove_da_sentence-collector.txt file in the comment above should be deleted? I see @fnielsen has put a thumbs-up emoji on it. If everyone agrees I can take that list and remove those sentences.
I have sampled some sentences in sentences_to_remove_da_sentence-collector.txt. Some are ok/okish.
@fnielsen Thank you for testing the file. The file contains duplicates, i.e. the "ok" and "okish" lines from your sampling exists in the dataset with a "." after. I do not think there is much reason to include both
@MichaelKohler What more do we need to get this change committed? Both @fnielsen and @jakobkappel has thumbs-upped in agreement, as far as I understand.
The discussion can continue on whether or not we should remove all fairy tale text from the corpus, but this is possible to do later as well.
If I may add. I think we should remove old texts, fairytales etc. as quickly as possible - it gives a bad impression of the project for any new users.
Sorry for not getting back to you on this, I'm not getting an email when there are emojis added to comments. In any case, the sentences are now deleted from Sentence Collector.
When trying to review Danish sentences, found that a lot of them contain spelling mistakes. This seems to happen in a lot of the sentences from Project Gutenberg. There were two issues with them:
So I've been clicking thumbs-down on some of those sentences, but figured it would be a lot of sentences to reject simply because of two spelling mistakes that appear in a lot of sentences. But if a lot of sentences contain the same misspelling and would have to get rejected, wouldn't it be better if someone can fix the mistakes so we don't lose a lot of otherwise usable sentences?
The å issue could be fixed in most sentences by find-replacing "aa" with "å" but there could be a few places where that would replace a correct spelling with a wrong one (I guess you could use a spellchecking dictionary to identify words that are already correct and not change those.
And then there's the capitalization thing. Maybe you could use a spellchecking dictionary to find out which words should be capitalized and which ones shouldn't. And if the previous word is "en" or "et", it's also likely to be a common noun that shouldn't be capitalized. I'm not sure how much work it would take to fix those automatically.
Also, I've realized DeepSpeech doesn't really care about capitalization. But you don't say that in the sentence reviewing instructions, so I'm assuming you do want correct capitalization in the sentences. Does capitalization show in the pronunciation of words? Well, uncapitalized, helen could mean "healing" as a noun, or it could mean "all of", but it's pretty uncommon. And it sounds different from Helen, a foreign name that would probably get mistranscribed. And then there's Eu (the short form of a longer Spanish name, it's not very common in Danish) and EU (what's known in English as "the EU") have different pronunciations, although the difference depends on the accent).
For now, I think I'll avoid voting on sentences with any of those two mistakes just in case.
I initially mentioned this on https://discourse.mozilla.org/t/danish-sentences-with-old-spelling-and-even-more-with-old-capitalization/73935 and later added a comment on an issue (#317) that seemed to be about spelling in another language. But then on Discourse, I read that I should make a separate issue specifically for the thing in Danish.