common-voice / sentence-collector

Tool to collect and review sentences for Common Voice
https://commonvoice.mozilla.org/sentence-collector/
Mozilla Public License 2.0
81 stars 62 forks source link

Lots of Danish sentences have spelling and capitalization mistakes because they were written more than 80 years ago #411

Closed SeaLiteral closed 1 year ago

SeaLiteral commented 3 years ago

When trying to review Danish sentences, found that a lot of them contain spelling mistakes. This seems to happen in a lot of the sentences from Project Gutenberg. There were two issues with them:

  1. Our alphabet got a new letter, å, like 80 years ago. A lot of words that should be spelled with that letter appear with two a's instead. Probably because they were written before the spelling reform. Examples of sentences that should have å but have aa: Instrumentet er saa lille, sagde Charlot. (correction: Instrumentet er så lille, sagde Charlot.) Hele Orkestret havde staaet deroppe. (correction: Hele orkestret havde stået deroppe.)
  2. I don't remember when the capitalization rules were changed, but those too were changed long ago. In fact, one of the sentences above has a capitalized noun that shouldn't be capitalized nowadays. More sentences with that: Han så efter Etager, hvor der blev tændt op til Fest. (Han så efter etager, hvor der blev tændt op til fest.) Han elskede de levninger, Moderen fik rundt i de "gode" Huse, hvor hun vaskede. (Han elskede de levninger, moderen fik rundt i de "gode" huse, hvor hun vaskede.)

So I've been clicking thumbs-down on some of those sentences, but figured it would be a lot of sentences to reject simply because of two spelling mistakes that appear in a lot of sentences. But if a lot of sentences contain the same misspelling and would have to get rejected, wouldn't it be better if someone can fix the mistakes so we don't lose a lot of otherwise usable sentences?

The å issue could be fixed in most sentences by find-replacing "aa" with "å" but there could be a few places where that would replace a correct spelling with a wrong one (I guess you could use a spellchecking dictionary to identify words that are already correct and not change those.

And then there's the capitalization thing. Maybe you could use a spellchecking dictionary to find out which words should be capitalized and which ones shouldn't. And if the previous word is "en" or "et", it's also likely to be a common noun that shouldn't be capitalized. I'm not sure how much work it would take to fix those automatically.

Also, I've realized DeepSpeech doesn't really care about capitalization. But you don't say that in the sentence reviewing instructions, so I'm assuming you do want correct capitalization in the sentences. Does capitalization show in the pronunciation of words? Well, uncapitalized, helen could mean "healing" as a noun, or it could mean "all of", but it's pretty uncommon. And it sounds different from Helen, a foreign name that would probably get mistranscribed. And then there's Eu (the short form of a longer Spanish name, it's not very common in Danish) and EU (what's known in English as "the EU") have different pronunciations, although the difference depends on the accent).

For now, I think I'll avoid voting on sentences with any of those two mistakes just in case.

I initially mentioned this on https://discourse.mozilla.org/t/danish-sentences-with-old-spelling-and-even-more-with-old-capitalization/73935 and later added a comment on an issue (#317) that seemed to be about spelling in another language. But then on Discourse, I read that I should make a separate issue specifically for the thing in Danish.

SeaLiteral commented 3 years ago

I just remembered which Danish word I've often thought about that changes its pronunciation depending on capitalization: Mors. Uncapitalized, mors (/moɐ̯s/) means "mum's", but capitalized, Mors (/mɒːs/) is the name of an island between Jutland and Thy. It's a place in Denmark, so more likely to get mentioned than the foreign names I mentioned in the description above. The distinction in pronunciation can depends on the accent, but in speech that's intended for a large audience or for transcription, distinguishing them seems a lot more common than pronouncing them the same way.

And I don't really know how speech recognition tools in general handle that type of words. I'd guess it's common to add extra words and say things like "øen Mors" ("the island of Mors") when there's a need for an accurate transcription, but that's based on having read subtitles in a different language (Spanish) where "sí" (meaning "yes", but it sounds very similar to a word meaning "if") got transcribed as "efectivamente" ("that's right"). The average person doesn't add extra words to give hints about difficult words.

MichaelKohler commented 3 years ago

Just a quick drive-by here, haven't had much time to look closer into this, but from what I've read I think we could already do parts of that. Every language can have a validation file which will run on all uploaded sentences. Sentences that don't match will be rejected. Currently the categories are quite constrained, so we might need a new category for this (as I said, haven't looked closely at this one here). This is how it would need to be done: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/VALIDATION.md

fnielsen commented 3 years ago

I have a tendency to say that sentence with old spelling should be rejected. Most of the sentences from "Excentriske noveller" are bad, both in terms of "aa" and the capital nouns as well as some dated language. I have rejected many, while accepting a few here and there. Finding public domain Danish texts with modern language is a bit of a task. The laws of Denmark should be ok and I have added a few from https://www.retsinformation.dk/eli/lta/2019/936

fnielsen commented 3 years ago

I now managed to complete the sentences that were shown to me for review. Most of the "Excentriske noveller" sentences I rejected. The other sentences was mostly ok.

fnielsen commented 3 years ago

I have now discovered https://github.com/common-voice/common-voice/blob/main/server/data/da/sentence-collector.txt and see some sentences are questionable:

MichaelKohler commented 3 years ago

Validation rules can now have an arbitrary amount of rules and each of these rules can define their own error message. Therefore I'm going to close this as this can be solved by adding a validation for Danish. More information can be found on https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/VALIDATION.md.

fnielsen commented 2 years ago

Someone have added lots of sentences from www.andersstories.com/da/andersen_fortaellinger/list. I think they all should be erased. For instance, "Nu er han vel under jorden og du med ide". While it contains no spelling that we can detect with a modern word-based spell checker, there is something odd with the sentence and e.g., "ide" should probably be a "Ide" to indicate a proper name for a person and not "idé". "Knap var hun paa ord, tvær af mine, men villig til sin dont". This is very antiquated language. In this case one could perhaps add a rule on "aa" (for the "paa" word), but in general there are lost of old sounding sentences that degrade the Common Voice sentence for Danish.

fnielsen commented 2 years ago

In https://github.com/common-voice/common-voice/blob/main/server/data/da/sentence-collector.txt strange things like

Should one pull request on that file?

MichaelKohler commented 2 years ago

Hi, thanks for reporting this. That file gets automatically exported to from the Sentence Collector. Any PR to that file would be overwritten again on the next export, so these sentences would need to be deleted in the Sentence Collector database.

If you identify any wrong sentence, you can look at its details by using this link and adjusting the sentence at the end: https://commonvoice.mozilla.org/sentence-collector/sentences/da?sentence=Andersen%20-%20stemmer%20solskinshistoriernu%20skal%20jeg%20fort%C3%A6lle

Or alternatively you can find all the sentences here: https://commonvoice.mozilla.org/sentence-collector/sentences/da

Every sentence has a "source" and a "batch" field. If you notice a pattern there, such as everything coming from the same source or even the same "batch", we can delete based on that. However for both cases we'd need to be a bit careful to not delete a lot of valid sentences with it as well.

We can also delete sentence by sentence if you can give me a text file with a sentence per line that should be deleted.

In the end, I'd like to have a Danish validation file to avoid these mistakes in the future. While deleting sentences is doable, it requires quite some effort to identify sentences to be deleted. If we can come up with a validation file, we could tackle the issue on its core and not even allow sentences to be uploaded that would later on need to be deleted.

jakobkappel commented 2 years ago

Someone have added lots of sentences from www.andersstories.com/da/andersen_fortaellinger/list. I think they all should be erased. For instance, "Nu er han vel under jorden og du med ide". While it contains no spelling that we can detect with a modern word-based spell checker, there is something odd with the sentence and e.g., "ide" should probably be a "Ide" to indicate a proper name for a person and not "idé". "Knap var hun paa ord, tvær af mine, men villig til sin dont". This is very antiquated language. In this case one could perhaps add a rule on "aa" (for the "paa" word), but in general there are lost of old sounding sentences that degrade the Common Voice sentence for Danish.

I agree.

All sentences from "https://www.andersenstories.com/da/andersen_fortaellinger/*" should be removed.

Very few of the 20,000+ sentences are correct sentences in modern Danish. This is not surprising since the texts were mostly written in the 1830'ies.

I don't see how a validation file can easily be produced that would identify even a minority of these sentences. The many incomplete sentences aside, the language used by HC. Andersen is not valid Danish today for both obvious and subtle reasons, eg.:

  1. Orthography, 'nouns are no longer capitalized - unless they are proper nouns, are placed after a full stop or (sometimes) after a colon', 'sometimes (but not always) "i" is used in nouns where "j" is used today' etc.
  2. Grammar, 'a verb in the present tense newer ends in "e"' in modern Danish, but a verb in the infinite does. Telling which is which requires an understanding of grammar and cannot be determined from lists of words.
  3. Order of words has changed in some cases.
  4. Vocabulary. Some words have fallen out of common usage or are not used in the same contexts anymore.
peterleth1 commented 2 years ago

I wrote a book (called Creative Commons for alle) some years ago. I was wondering if the text could be of any use for our collection of sentences. I think the text should be cleaned, so that numbers and english names (like Creative Commons) is changed to something more generic like "hest" But if I do clean the text how do I then import 70 pages of text. With the sentence collector or by uploading a file to...? Thank you. I would like to see this project succeed!

fnielsen commented 2 years ago

I was wondering if the text could be of any use for our collection of sentences.

I think that would be fine! I wonder if English names are a problem.

MichaelKohler commented 2 years ago

But if I do clean the text how do I then import 70 pages of text. With the sentence collector or by uploading a file to...? Thank you. I would like to see this project succeed!

Depends on the amount of sentences. 70 pages of text might not qualify for the bulk upload, though I'm not sure how hard that limit is (haven't recently checked text submitted through that, so might be worth a try). You can find more info in the "Bulk submission" section at https://common-voice.github.io/community-playbook/sub_pages/text.html.

Strads commented 2 years ago

I agree with all points above - the HC Andersen texts are old and do not represent modern Danish. Furthermore, the import seems to have many errors.

I have cleaned up the sentence collector file before seeing that the sentence collector does not use pull requests.

Is there a way to edit the sentences in the database instead of deleting them? I can extract the lines I have identified that need to be edited or deleted, and we could delete all. I do not think it is a loss of quality for the database to delete them all instead of doing the edits below.

I have:

You can see the changes here: https://github.com/common-voice/common-voice/compare/main...Strads:common-voice:main

This still leaves a lot of "old Danish" sentences in the dataset, but at least we do not have to speak each sentence twice! ( I am not sure I have ever recorded anything not from HC Andersen - which makes me think the common voice dataset will work fantasticly for fairy tales)

Attached is a file with sentences that are duplicates (some maybe not be valid due to being detected as duplicates after removing leading symbols), the "hans christian..." sentences, and the sentences not transcribed properly. The sentences with leading symbols are not included. sentences_to_remove_da_sentence-collector.txt

MichaelKohler commented 2 years ago

@jakobkappel would you also agree that the sentences in the sentences_to_remove_da_sentence-collector.txt file in the comment above should be deleted? I see @fnielsen has put a thumbs-up emoji on it. If everyone agrees I can take that list and remove those sentences.

fnielsen commented 2 years ago

I have sampled some sentences in sentences_to_remove_da_sentence-collector.txt. Some are ok/okish.

  1. 1 Abc-bogen eventyret eventyr af hans christian andersen h (wrong)
  2. 101 Snedronningen eventyret eventyr af hans christian andersen h (wrong)
  3. 501 Da trådte ind i stuen et barn, dronningens lille søn (strange old-fashion sentence ordering)
  4. 1001 Den lå der en time, den lå der i to (ok)
  5. 1201 Der går den gamle vaskekone omme fra strædet (ok)
  6. 1501 Det er den, som har rørt sig i ham, nu er det overstået (okish)
  7. 2001 Det skulle ikke ligge der, men som det lå gav det ly (ok)
  8. 2501 Du havde ikke holdt ud at see paa det, moer (old spelling)
  9. 3001 Gud ved hvor rig han var endda (ok/okish)
  10. 4001 Hvor det er velbetænkt (ok)
  11. 5001 Jeg synes her kommer nogen lige bagefter (okish)
Strads commented 2 years ago

@fnielsen Thank you for testing the file. The file contains duplicates, i.e. the "ok" and "okish" lines from your sampling exists in the dataset with a "." after. I do not think there is much reason to include both

  1. Den lå der en time, den lå der i to
  2. Den lå der en time, den lå der i to. in the dataset, and therefore the one without "." is included in the file for removal.
Strads commented 2 years ago

@MichaelKohler What more do we need to get this change committed? Both @fnielsen and @jakobkappel has thumbs-upped in agreement, as far as I understand.

The discussion can continue on whether or not we should remove all fairy tale text from the corpus, but this is possible to do later as well.

peterleth1 commented 2 years ago

If I may add. I think we should remove old texts, fairytales etc. as quickly as possible - it gives a bad impression of the project for any new users.

MichaelKohler commented 2 years ago

Sorry for not getting back to you on this, I'm not getting an email when there are emojis added to comments. In any case, the sentences are now deleted from Sentence Collector.