Script for bundling Common Voice (https://voice.mozilla.org) clips by language.
clips.tsv
file with anonymized client_id
valuesstats.json
corpora-creator
, which will take the clips.tsv
file and analyze it to create test/dev/train sets for machine learning purposes.tar.gz
bundles according to your settings, usually one per languagestats.json
file and also upload that to S3git clone git@github.com:Common-Voice/common-voice-bundler.git
config.json
in the same diryarn
yarn start
corpora-creator
separately. Follow the instructions.In order to run this, you need to override the default keys defined in config.js with a config.json
in the same directory. At an absolute minimum, you will need:
releaseName
: the name of the release. this can take the form of an AWS key, and /
in the name will be treated as directoriesqueryFile
: the name of the file that specifies the SQL query for a given dataset - see /queries
directory for past filesdb
objectclipBucket
objectoutBucket
object (which refers to the bucket that the bundled dataset will be hosted on)The other options are:
cutoffTime
: clips will only be downloaded if they were created before this timestartCutoffTime
: clips will only be included if they were created after this time. To be used for delta releases and inconjuction with cutoffTime
skipBundling
: this will do everything except bundle and upload clips (used mostly for testing)skipCorpora
: this will do everything but skip waiting for you to create the corpora (used if the process was interrupted and you already have the appropriate corpora)skipHashing
: this will skip hashing the client ID (used mostly for testing)skipDownload
: this will skip downloading the file and just create the clips.tsv
(used mostly for testing)skipMinorityCheck
: this will skip checking which languages have fewer than 5 speakersskipReportedSentences
: this will not include the list of reported sentences in each dataset (used for the singleword target segment bundle)startFromCorpora
: this will begin the whole process at the prompt for the corpora (used if the process was interrupted and you already have all the files and clips metadata)singleBundle
: this will create a single archive with all languages, instead of one tar per languageYou should run this script from a tmux
in the EC2 shell you're provided with, so that if your connection dies the script can still continue to run. Sometimes, the script itself will die, in which case it will attempt to gracefully recover in the following ways:
stats.json
as much as possible, so that you have in-progress stats even if the whole process doesn't finishIn addition, you can use the options specified above to resume from key points in the process instead of running through the entire process from scratch.
stats.json
has durations of 0 for some/all languages: mp3-duration-sum
runs in the background after all the clips have been downloaded, and there is no signal when it completes other than the stats file receiving updated durations. If you skip corpora creation or if most of your tar files have already been created, the script my terminate before mp3-duration-sum
has completed and updated the stats file. The work around for this is to artificially pause the script by setting skipCorpora
as false, and simply not moving onto the next stage until you've verified that the durations have been updatedCorporaCreator
terminates or runs out of memory: The Corpora Creator is itself somewhat fragile, as it hasn't been substantially updated since it was created, and may need tweaking to run. You can test where the bug is by creating a smaller version of clips.tsv
by taking the first 10,000 rows using head
and then trying to run CorporaCreator
on the smaller file, to identify whether the bug is the file size or your install. If the problem is the file size, you may need to upgrade to a larger instance of EC2. Contact IT-SE.