common-voice / common-voice-bundler

Script for bundling Common Voice (https://commonvoice.mozilla.org/) clips by language
10 stars 7 forks source link

CommonVoice Bundler

Script for bundling Common Voice (https://voice.mozilla.org) clips by language.

What it does

  1. Query database for all clip data
  2. Download all those clips from an S3, separated into language directories
  3. Write the clips metadata to a clips.tsv file with anonymized client_id values
  4. Analyze the clips metadata and assemble aggregate stats for stats.json
  5. Calculate the total duration of each dataset
  6. Prompt you to run corpora-creator, which will take the clips.tsv file and analyze it to create test/dev/train sets for machine learning purposes
  7. Create .tar.gz bundles according to your settings, usually one per language
  8. Create a checksum for each tarball
  9. Upload the tarball to S3
  10. Write the checksum to the stats.json file and also upload that to S3

How to run it

  1. Install node (>= 8.3.0)
  2. Install yarn
  3. Install CorporaCreator
  4. Install mp3-duration-sum
  5. git clone git@github.com:Common-Voice/common-voice-bundler.git
  6. Override the keys defined in config.js with a config.json in the same dir
  7. yarn
  8. yarn start
  9. You will be prompted to run corpora-creator separately. Follow the instructions.

Configuration

In order to run this, you need to override the default keys defined in config.js with a config.json in the same directory. At an absolute minimum, you will need:

The other options are:

Resume from interruptions

You should run this script from a tmux in the EC2 shell you're provided with, so that if your connection dies the script can still continue to run. Sometimes, the script itself will die, in which case it will attempt to gracefully recover in the following ways:

In addition, you can use the options specified above to resume from key points in the process instead of running through the entire process from scratch.

Troubleshooting