common-voice / sentence-collector

Tool to collect and review sentences for Common Voice
https://commonvoice.mozilla.org/sentence-collector/
Mozilla Public License 2.0
81 stars 64 forks source link

Italian has Chinese in it #638

Closed coreymillerrev closed 1 year ago

coreymillerrev commented 1 year ago

cv-corpus-10.0-2022-07-04/it/cli ps/common_voice_it_23771747.mp3 has this corresponding text: oggi in cinese si utilizzano indifferentemente gli ideogrammi 張三丰 oppure 張三峰

I see the following problems:

MichaelKohler commented 1 year ago

Thanks for filing this issue.

sentence collector should be programmed to only use sentences with the admitted characters.

This can be done by improving https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/it.js.

sentences should be revalidated automatically later in the process (if they can change)

They can't.