A collection of scripts to create alternative splits, to check important measures in multiple Common Voice releases, languages and alternate splitting strategies.
This tooling will be part of ToolBox, to be released separately. It will evantually be transformed into a more generalized script in the core.
In the current state the toolchain is:
cv-tbox-split-maker (create splits) => cv-tbox-dataset-compiler (compile detailed statistics) => cv-tbox-dataset-analyzer (web interface/visualization tool for statistics)
Note: This repository has been renamed from "Common Voice Diversity Check" into "Common Voice Toolbox - Split Maker"
The process of doing these with L languages, V versions and A splitting algorithmn means repeated processing of L*V*A splits creation and their analysis.
This is where this tool comes in. You just put your data in directories and feed them to scripts.
In the required execution order:
python3 extract.py [--all] [--delta] [--force]
Expands files from downloaded .tar.gz dataset files residing in a directory, into another directory.
Options:
clips
are expanded. If not specified, only the .tsv
files are extracted.--force
is specified, existing languages/files get overwritten.python3 merge_delta.py
We have previous FULL dataset(s) (e.g. v18.0) and downloaded DELTA dataset(s) (v19.0 delta) - and extracted all of the .tsv (and maybe .mp3) files. This script combines them and creates the new FULL dataset (e.g. v19.0 FULL). But:
.mp3
files to prevent duplication. If you want, you can just merge the clips directories (e.g. v18.0/clips/*.mp3
and v19.0-delta/clips/*.mp3
into the new v19.0/clips
by moving them)train.tsv
, dev.tsv
, and test.tsv
files). It is the job of the s1
algorithm as explained below. After creating the default splits, you can manually copy them into the created full version directory (e.g. v19.0 in the above example)The script handles the following:
validated.tsv
, invalidated.tsv
, reported.tsv
and clip_durations.tsv
files and writes them into new full dataset directoryother.tsv
: Some of the records in old other.tsv
might be distributed to validated.tsv
and invalidated.tsv
, others might be added. We drop the moved ones and add new ones from the new DELTA.*_sentences.tsv
files, but this might change in future CV releases)reported.tsv
, which can include multiple reports of the same sentence with the same reason)After running this script you should:
collect.py
(see below)s1
algorithm to create the default splits for the new FULL dataset (i.e. train.tsv
, dev.tsv
, and test.tsv
)python3 collect.py
From the expanded FULL dataset directory, copies metadata files to internal space to work on (these include common .tsv files and default split files).
NOTE: While running the algorithmns, we drop any voices that we know that they have deleted their recordings. We keep the data under the <repo_root>/data
directory. We will be analyzing the complete data and extend this list. Deleted users' recordings do not exist in later CV releases - after their deletion request, but we should also honor their wishes when working with previous versions.
To prepare:
Python v3.12.x+
, create a venv and activate it (prefer latest - we work with latest and do not check compatibility issues in older versions).pip install -U -r requirements.txt
to install the dependencies./downloaded_datasets/cv-corpus-18.0-2024-06-14/*.tar.gz
)/downloaded_datasets/cv-corpus-19.0-delta-2024-09-13/*.tar.gz
)/datasets
which would include subdirs like cv-corpus-18.0-2024-06-14/<lc>
when expanded)conf.py
to point to these directories and specify the versions you want to work with.python extract.py
to extract only the .tsv files from the datasets (see above for options)python merge_delta.py
to merge previous full versions with new delta versions (do not forget to run the s1
algorithm for in this case)python collect.py
to copy the metadata files into the internal working areaThe internal data directory structure is like:
data_root/experiments
<exp> # e.g. s1, s99, v1
<cv-corpus> # e.g. cv-corpus-NN.N-YYYY-MM-DD
<lc> # Language directory (en, tr etc)
*.tsv # Metadata (*.tsv only)
...
...
<exp>
...
Under experiments/s1
, ALL .tsv
files from the release can be found. Other algorithm directories only contain train/dev/test.tsv
files.
A NOTE: We work with all versions and languages to analyze them. But you can work with a single language or a couple of languages. The scripts are data-driven, they will process what you put into the source directories. So you might we be working with a single language and want to add another, no problem (if not forced to overwrite from config.py
, the scripts will exclude already processed languages checking directory existance).
The data we use is huge and not suited for github. We used the following:
For vw
and vx
we limited the process to only include datasets with >=2k recordings in validated bucket. Language codes for different dataset flavors are listed in languages.py
file.
Compressed splits for each language / dataset version / algorithm can be found under the shared Google Drive location. To use them in your trainings, just download one and override the default train/dev/test.tsv
files in your expanded dataset directory. Make sure you match the versions.
AGPL v3.0
Here are some performance metrics I recorded on the following hardware after the release of Common Voice v16.1, where I re-implemented the multipprocessing and ran the whole set of algorithms (except s1, which I've taken from the releases) on all active CV releases (we leave out intermediate/corrected ones like v2, 5.0, 6.0, 16.0 etc).
Algo | Total DS | Processed DS | Total Dur | Avg Sec per DS |
---|---|---|---|---|
s99 | 1,222 | 1,207 | 05:40:02 | 16.904 |
v1 | 1,222 | 1,222 | 00:05:03 | 0.251 |
vw | 1,222 | 633 | 00:03:32 | 0.336 |
vx | 1,222 | 617 | 00:03:23 | 0.330 |
DS: Dataset. All algorithms ran on 12 parallel processes
You can look at the results Common Voice Dataset Analyzer. This will eventually be part of the Common Voice Toolbox's Core, but it will be developed here...
The project status can be found on the project page. Please post issues and feature requests, or Pull Requests to enhance.
Although this will go into the core, we will also publish it separately. Below you'll find what it will become (tentative, might change):
The script can use alternative splitting strategies for you to try on your language/languages. Then you should re-run "tbox_diversity_table" to analyze statistics in these new splits and compare with other strategies.
python3 tbox_split_maker <--split_strategy|--ss <strategy_code> [<parameter>] > [--exp <experiments_directory>] --in <path|experiment> --out <path|experiment> [--verbose]
Options:
--exp
--in <path|experiment> : If an existing full path is given, that directory is used to feed the default splits (eg: --in c:\datasets\cv\v10.0\en). If only a string is given (e.g. --in releases) given string is assumed to be under the experiments_directory and searched there.
--out <path|experiment> : If an existing full path is given (e.g. --out d:\trials\splits), it will be used for output of new splits. If only a string is given (e.g. --out releases) given string is assumed to be under the experiments_directory and created there. In either case, first the source is copied to destination and THEN the new slits override the existing ones. Other unrelated .tsv files are also being copied for dataset completeness. If your splits seem ok, you can just copy-override the original dataset you downladed and expanded with these files.000
--verbose: Prints out more information, by default minimal information is displayed.
Currently supported strategies:
--split_strategy cc [N]
Run Common Voice's Corpora Creator with alternative recordings per sentence setting. By default, this is 1, meaning there is 1 recording per sentence in the final splits, even if different users might record sentences multiple times. Although this setting is meaningful to prevent sentence bias, it might be desirable to have sentences recorded by different voices/genders/ages/accents so that your model gets better on alternatives. Also, especially with low resource languages, the default setting drops the training split size to a small fraction of what's available.
For this to work, you need to clone and compile Mozilla Common Voice CorporaCreator repo as follows:
git clone https://github.com/common-voice/CorporaCreator.git
cd CorporaCreator; python3 setup.py install
--split_strategy sentence
In this strategy sentence unbiasing has the presedence, so that no sentence is repeated in other splits. But voices (people) can exists in other splits. This strategy ensures that the whole validated set is used. You might like to experiment with percentages thou. Usually 80-10-10 or 70-15-15 are considered good values (make sure they add up to 100).
Ex:
python tbox_split_maker --exp ~/cv/experiments --in default --out test-70-15-15 --ss sentence 70 15 15
--split_strategy sentence-voice
This algorithm is similar to the "sentence" strategy, but it ensures no same voice exists in other splits. Therefore the dataset will not fully be used. The test and dev will be as desired, but the train split will be smaller, not adding to 100. This strategy prevents both sentence and voice bias and usually uses most of the validated set. Unused amount totaly depends on the dataset, how much text-corpus, voice contributors, repeated recording it has, and if people are recording too few and/or too much sentences. Therefore it is a good practice to analyze the generated split with tbox_directory_table script...
Ex:
python tbox_split_maker --exp ~/cv/experiments --in default --out test-70-15-15 --ss sentence-voice 70 15 15 # note that these numbers are target, result will be different
--split_strategy random
This is a dummy algorithm and does not care on any bias. It splits the whole dataset randomly and fully. The resultant model performance in terms of bias might also be random. Here, also the split percentages doesn't need to add up to 100. This is provided for experimenting, by slicing different smaller sizes.
Ex:
python tbox_split_maker --exp ~/cv/experiments --in default --out random-50-10-10 --ss sentence-voice 50 10 10
Note: We will add more options to limit data, e.g. taking max N recordings from a single voice, taking demographic data into account, such as equal gender, etc.