HarikalarKutusu / cv-tbox-split-maker

Checks diversity in Mozilla Common Voice default or alternative splits for multiple versions and languages
Mozilla Public License 2.0
1 stars 0 forks source link

Common Voice Diversity Check

A collection of scripts to create alternative splits, to check important measures in multiple Common Voice relases, languages and alternate splitting strategies.

This tooling will be part of Common Voice ToolBox, released separately. It will evantually be transformed into a more generalized script in the core.

Why?

The process of doing these with L languages, V versions and S splitting strategies (+three splitting algorithms in each) means repeated processing of 3LV*S splits.

This is where this tool comes in. You just put your data in directories and feed them to scripts. The basic measures are compiled into a tsv file which you can process with your own scripts or other tools. We also included an MS Excel file as an analysis frontend to this data.

Scripts

In the required execution order:

How

To prepare:

The internal data directory structure is like:

clone_root/experiments
    <exp>                           # e.g. s1, s99, v1
        <cv-corpus>                 # e.g. cv-corpus-NN.N-YYYY-MM-DD
            <lc>                    # Language directory (en, tr etc)
                *.tsv               # Metadata (*.tsv only)
            ...
        ...
    <exp>
        ...

Under experiments/s1, all .tsv files from the release can be found. Other algorithm directories only contain train/dev/test.tsv files.

Algorithms and the data

The data we use is huge and not suited for github. We used the following:

For vw and vx we limited the process to only include datasets with >=2k recordings in validated bucket.

Compressed splits for each language / dataset version / algorithm can be found under the shared Google Drive location. To use them in your trainings, just download one and override the default train/dev/test.tsv files in your expanded dataset directory.

Other

License

AGPL v3.0

Some Performance Metrics

Here are some performance metrics I recorded on the following hardware after the release of Common Voice v16.1, where I re-implemented the multipprocessing and ran the whole set of algorithms (except s1, which I've taken from the releases) on all active CV releases (left out intermediate/corrected ones like v2, 5.0, 16.0 etc).

Algo Total DS Processed DS Total Dur Avg Sec per DS
s99 1,222 1,207 05:40:02 16.904
v1 1,222 1,222 00:05:03 0.251
vw 1,222 633 00:03:32 0.336
vx 1,222 617 00:03:23 0.330

DS: Dataset. All algorithms ran on 12 parallel processes

TO-DO/Project plan, issues and feature requests

You can look at the results Common Voice Dataset Analyzer. This will eventually be part of the Common Voice Toolbox's Core, but it will be developed here...

The project status can be found on the project page. Please post issues and feature requests, or Pull Requests to enhance.



FUTURE REFERENCE

Although this will go into the core, we will also publish it separately. Below you'll find what it will become (tentative, might change):

tbox_split_maker

The script can use alternative splitting strategies for you to try on your language/languages. Then you should re-run "tbox_diversity_table" to analyze statistics in these new splits and compare with other strategies.

python3 tbox_split_maker <--split_strategy|--ss <strategy_code> [<parameter>] > [--exp <experiments_directory>] --in <path|experiment> --out <path|experiment> [--verbose]

Options:

--exp : If given, experiments are searched/created under this directory. If not given, experiments directory under cloned repo will be used.

--in <path|experiment> : If an existing full path is given, that directory is used to feed the default splits (eg: --in c:\datasets\cv\v10.0\en). If only a string is given (e.g. --in releases) given string is assumed to be under the experiments_directory and searched there.

--out <path|experiment> : If an existing full path is given (e.g. --out d:\trials\splits), it will be used for output of new splits. If only a string is given (e.g. --out releases) given string is assumed to be under the experiments_directory and created there. In either case, first the source is copied to destination and THEN the new slits override the existing ones. Other unrelated .tsv files are also being copied for dataset completeness. If your splits seem ok, you can just copy-override the original dataset you downladed and expanded with these files.000

--verbose: Prints out more information, by default minimal information is displayed.

Currently supported strategies:

--split_strategy cc [N]

Run Common Voice's Corpora Creator with alternative recordings per sentence setting. By default, this is 1, meaning there is 1 recording per sentence in the final splits, even if different users might record sentences multiple times. Although this setting is meaningful to prevent sentence bias, it might be desirable to have sentences recorded by different voices/genders/ages/accents so that your model gets better on alternatives. Also, especially with low resource languages, the default setting drops the training split size to a small fraction of what's available.

For this to work, you need to clone and compile Mozilla Common Voice CorporaCreator repo as follows:

git clone https://github.com/common-voice/CorporaCreator.git
cd CorporaCreator; python3 setup.py install

--split_strategy sentence

In this strategy sentence unbiasing has the presedence, so that no sentence is repeated in other splits. But voices (people) can exists in other splits. This strategy ensures that the whole validated set is used. You might like to experiment with percentages thou. Usually 80-10-10 or 70-15-15 are considered good values (make sure they add up to 100).

Ex:

python tbox_split_maker --exp ~/cv/experiments --in default --out test-70-15-15 --ss sentence 70 15 15

--split_strategy sentence-voice

This algorithm is similar to the "sentence" strategy, but it ensures no same voice exists in other splits. Therefore the dataset will not fully be used. The test and dev will be as desired, but the train split will be smaller, not adding to 100. This strategy prevents both sentence and voice bias and usually uses most of the validated set. Unused amount totaly depends on the dataset, how much text-corpus, voice contributors, repeated recording it has, and if people are recording too few and/or too much sentences. Therefore it is a good practice to analyze the generated split with tbox_directory_table script...

Ex:

python tbox_split_maker --exp ~/cv/experiments --in default --out test-70-15-15 --ss sentence-voice 70 15 15 # note that these numbers are target, result will be different

--split_strategy random

This is a dummy algorithm and does not care on any bias. It splits the whole dataset randomly and fullt. The reultant model performance in terms of bias might also be random. Here, also the split percentages doesn't need to add up to 100. This is provided for experimenting, by slicing different smaller sizes.

Ex:

python tbox_split_maker --exp ~/cv/experiments --in default --out random-50-10-10 --ss sentence-voice 50 10 10