common-voice / sentence-collector

Tool to collect and review sentences for Common Voice
https://commonvoice.mozilla.org/sentence-collector/
Mozilla Public License 2.0
81 stars 64 forks source link

Question: Process for already exported sentences? #632

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 2 years ago

After all efforts, there are some sentences with mistypings already exported. In addition to these, there are some bad sentences from unmoderated times (mostly abbreviations and many foreign words, these are mostly reported and in reported.tsv).

These result to misc recordings. Unknown/foreign words get mispronounced and mistyping can be read as they are or read corrected...

Is there a process to correct these?

We are working on dataset health issues and in the process of correcting them in a post-process phase. But if these get not corrected they might cause additional bad recordings in the future.

MichaelKohler commented 2 years ago

@Heyhillary could you bring this up with the team? I think we need a general process or documentation for these cases and not just something from the sentence collector side. What do you think? Thanks!

drzraf commented 1 year ago

(Coming from https://github.com/common-voice/common-voice/pull/3786)

I think that sentences creating problems:

  1. Should be removed from the corpus
  2. Should be removed from the list of recordings
  3. An adequate filter put in place in the cleanup/validation in order to fix this particular class of problems for any future sentence import

Rational: It's of uttermost importance to keep the corpus clean in order to:

HarikalarKutusu commented 1 year ago

it's likely that problematic sentences have been spoken only once or twice, making possible errors more impacting.

Unfortunately, more than that for Turkish... In the first 4 years without any moderation and with the initial text-corpus, SETimes, which are Balkan news and has many proper names from Balkans, people recorded these 3-4 times. This was how I started this journey :/

I'm currently writing a middleware (open-source) to exclude those sentences or some bad voices (which needs a long moderation by multiple people, via another software) before feeding to training. We will see in a month or so...

MichaelKohler commented 1 year ago

@HarikalarKutusu I'm archiving this project. Instead of just moving this issue over to the main CV repo, I would suggest to create a new, more generic issue if still relevant.