WebApp for viewing detailed statistics of Common Voice datasets, along with text-corpora.
This tool is created for language communities on Common Voice and those who train models using these datasets. The main purpose is to view the general and detailed statistical characteristics of datasets, with special emphasis on their health and diversity, so that communities can direct their efforts to correct the problem areas, particularly in voice, gender, and transcript biasing. The data presented here are the results of long offline calculations done by the Common Voice Dataset Compiler. Currently, it covers the most important ones, but new measures will be added in time.
The user interface currently supports the following languages: Catalan
, English
(default), and Turkish
. Please post a PR to add your language.
Because the data is rather large it is divided and related portions will be loaded whenever you click a language-version pair. The table below shows all languages Common Voice supports, with version info this application supports. To shorten the list, please use the filter feature at the bottom to select one or more locales you are interested in.
If you are just interested in the general status between versions, you can use the sister app Common Voice Metadata Viewer (Beta Mirror).
In this application, under each dataset, you may see results for multiple splitting algorithms, namely s1
, s99
, and v1
for now (some older datasets may miss some). These splitting algorithms are:
A working (beta) version is here for your use: Common Voice Dataset Analyzer (Beta Mirror) We first relase to the beta site after a CV release and try to fix any inconsistencies. When fixed, both versions are updated to a stable version.
cd web
)npm install
to get the dependenciesnpm start
to run on localgit clone https://github.com/HarikalarKutusu/cv-tbox-dataset-analyzer.git
cd cv-tbox-dataset-analyzer
cd web
npm install
npm start
The whole list is under the project in github. Please open issues or feature requests or make pull requests.