HarikalarKutusu / cv-tbox-split-maker

Creates alternative splits for Mozilla Common Voice datasets for further analysis. Supports delta-version upgrades.
Mozilla Public License 2.0
1 stars 0 forks source link
commonvoice datasets metadata splitting voice-ai

Common Voice Toolbox - Split Maker

A collection of scripts to create alternative splits, to check important measures in multiple Common Voice releases, languages and alternate splitting strategies.

This tooling will be part of ToolBox, to be released separately. It will evantually be transformed into a more generalized script in the core.

In the current state the toolchain is:

cv-tbox-split-maker (create splits) => cv-tbox-dataset-compiler (compile detailed statistics) => cv-tbox-dataset-analyzer (web interface/visualization tool for statistics)

Note: This repository has been renamed from "Common Voice Diversity Check" into "Common Voice Toolbox - Split Maker"

Why?

The process of doing these with L languages, V versions and A splitting algorithmn means repeated processing of L*V*A splits creation and their analysis.

This is where this tool comes in. You just put your data in directories and feed them to scripts.

Scripts

In the required execution order:

Extracting from downloaded datasets

python3 extract.py [--all] [--delta] [--force]

Expands files from downloaded .tar.gz dataset files residing in a directory, into another directory.

Options:

Merging previous versions with delta releases to get the newest relase

python3 merge_delta.py

We have previous FULL dataset(s) (e.g. v18.0) and downloaded DELTA dataset(s) (v19.0 delta) - and extracted all of the .tsv (and maybe .mp3) files. This script combines them and creates the new FULL dataset (e.g. v19.0 FULL). But:

The script handles the following:

After running this script you should:

Importing extracted (and/or delta-merged) .tsv files into split-maker

python3 collect.py

From the expanded FULL dataset directory, copies metadata files to internal space to work on (these include common .tsv files and default split files).

Splitting

NOTE: While running the algorithmns, we drop any voices that we know that they have deleted their recordings. We keep the data under the <repo_root>/data directory. We will be analyzing the complete data and extend this list. Deleted users' recordings do not exist in later CV releases - after their deletion request, but we should also honor their wishes when working with previous versions.

Discontinued script (the results can already be seen in the Dataset Analyzer)

How

To prepare:

The internal data directory structure is like:

data_root/experiments
    <exp>                           # e.g. s1, s99, v1
        <cv-corpus>                 # e.g. cv-corpus-NN.N-YYYY-MM-DD
            <lc>                    # Language directory (en, tr etc)
                *.tsv               # Metadata (*.tsv only)
            ...
        ...
    <exp>
        ...

Under experiments/s1, ALL .tsv files from the release can be found. Other algorithm directories only contain train/dev/test.tsv files.

A NOTE: We work with all versions and languages to analyze them. But you can work with a single language or a couple of languages. The scripts are data-driven, they will process what you put into the source directories. So you might we be working with a single language and want to add another, no problem (if not forced to overwrite from config.py, the scripts will exclude already processed languages checking directory existance).

Algorithms and the data

The data we use is huge and not suited for github. We used the following:

For vw and vx we limited the process to only include datasets with >=2k recordings in validated bucket. Language codes for different dataset flavors are listed in languages.py file.

Compressed splits for each language / dataset version / algorithm can be found under the shared Google Drive location. To use them in your trainings, just download one and override the default train/dev/test.tsv files in your expanded dataset directory. Make sure you match the versions.

Other

License

AGPL v3.0

Some Performance Metrics

Here are some performance metrics I recorded on the following hardware after the release of Common Voice v16.1, where I re-implemented the multipprocessing and ran the whole set of algorithms (except s1, which I've taken from the releases) on all active CV releases (we leave out intermediate/corrected ones like v2, 5.0, 6.0, 16.0 etc).

Algo Total DS Processed DS Total Dur Avg Sec per DS
s99 1,222 1,207 05:40:02 16.904
v1 1,222 1,222 00:05:03 0.251
vw 1,222 633 00:03:32 0.336
vx 1,222 617 00:03:23 0.330

DS: Dataset. All algorithms ran on 12 parallel processes

TO-DO/Project plan, issues and feature requests

You can look at the results Common Voice Dataset Analyzer. This will eventually be part of the Common Voice Toolbox's Core, but it will be developed here...

The project status can be found on the project page. Please post issues and feature requests, or Pull Requests to enhance.



FUTURE REFERENCE

Although this will go into the core, we will also publish it separately. Below you'll find what it will become (tentative, might change):

tbox_split_maker

The script can use alternative splitting strategies for you to try on your language/languages. Then you should re-run "tbox_diversity_table" to analyze statistics in these new splits and compare with other strategies.

python3 tbox_split_maker <--split_strategy|--ss <strategy_code> [<parameter>] > [--exp <experiments_directory>] --in <path|experiment> --out <path|experiment> [--verbose]

Options:

--exp : If given, experiments are searched/created under this directory. If not given, experiments directory under cloned repo will be used.

--in <path|experiment> : If an existing full path is given, that directory is used to feed the default splits (eg: --in c:\datasets\cv\v10.0\en). If only a string is given (e.g. --in releases) given string is assumed to be under the experiments_directory and searched there.

--out <path|experiment> : If an existing full path is given (e.g. --out d:\trials\splits), it will be used for output of new splits. If only a string is given (e.g. --out releases) given string is assumed to be under the experiments_directory and created there. In either case, first the source is copied to destination and THEN the new slits override the existing ones. Other unrelated .tsv files are also being copied for dataset completeness. If your splits seem ok, you can just copy-override the original dataset you downladed and expanded with these files.000

--verbose: Prints out more information, by default minimal information is displayed.

Currently supported strategies:

--split_strategy cc [N]

Run Common Voice's Corpora Creator with alternative recordings per sentence setting. By default, this is 1, meaning there is 1 recording per sentence in the final splits, even if different users might record sentences multiple times. Although this setting is meaningful to prevent sentence bias, it might be desirable to have sentences recorded by different voices/genders/ages/accents so that your model gets better on alternatives. Also, especially with low resource languages, the default setting drops the training split size to a small fraction of what's available.

For this to work, you need to clone and compile Mozilla Common Voice CorporaCreator repo as follows:

git clone https://github.com/common-voice/CorporaCreator.git
cd CorporaCreator; python3 setup.py install

--split_strategy sentence

In this strategy sentence unbiasing has the presedence, so that no sentence is repeated in other splits. But voices (people) can exists in other splits. This strategy ensures that the whole validated set is used. You might like to experiment with percentages thou. Usually 80-10-10 or 70-15-15 are considered good values (make sure they add up to 100).

Ex:

python tbox_split_maker --exp ~/cv/experiments --in default --out test-70-15-15 --ss sentence 70 15 15

--split_strategy sentence-voice

This algorithm is similar to the "sentence" strategy, but it ensures no same voice exists in other splits. Therefore the dataset will not fully be used. The test and dev will be as desired, but the train split will be smaller, not adding to 100. This strategy prevents both sentence and voice bias and usually uses most of the validated set. Unused amount totaly depends on the dataset, how much text-corpus, voice contributors, repeated recording it has, and if people are recording too few and/or too much sentences. Therefore it is a good practice to analyze the generated split with tbox_directory_table script...

Ex:

python tbox_split_maker --exp ~/cv/experiments --in default --out test-70-15-15 --ss sentence-voice 70 15 15 # note that these numbers are target, result will be different

--split_strategy random

This is a dummy algorithm and does not care on any bias. It splits the whole dataset randomly and fully. The resultant model performance in terms of bias might also be random. Here, also the split percentages doesn't need to add up to 100. This is provided for experimenting, by slicing different smaller sizes.

Ex:

python tbox_split_maker --exp ~/cv/experiments --in default --out random-50-10-10 --ss sentence-voice 50 10 10

Note: We will add more options to limit data, e.g. taking max N recordings from a single voice, taking demographic data into account, such as equal gender, etc.