HarikalarKutusu / cv-tbox-dataset-analyzer

Analysis and Viewer for Mozilla Common Voice Datasets
GNU Affero General Public License v3.0
3 stars 1 forks source link

[FR] About status of text-corpora analysis. #57

Open HarikalarKutusu opened 1 year ago

HarikalarKutusu commented 1 year ago

Mozilla Common Voice started to use the database for new text-corpus directly, without exporting newly added (validated) sentences to the public. Therefore, our analysis on text-corpora is outdated (not changed after March 2023 release v13.0).

You can read about the issue and possible solutions on the Common Voice repo:

https://github.com/common-voice/common-voice/issues/4100

It seems until it is fixed, there is nothing we can do about this. Any other idea is most welcome.

HarikalarKutusu commented 3 months ago

With v17.0, the text-corpora is released. Although it has problems, we could mitigate most of them in the Dataset Compiler.