wav2vec workflow - Githubissues

Thanks for your interests! For the training data preparation, you can try following their tutorial to get familiar with the input data structure. The CommonVoice data are good examples, and the length of each training sample should be no more than a few seconds.

The only difference in the workflow involving wav2vec pre-training is that we use .h5context files produced by wav2vec instead of the .wav files. (The .h5context files are those referred as "embeddings" in the wav2vec instructions).

For GPU, I trained my data on single P100. Pre-trainning on wav2vec is not resource-demanding (136hrs of speech took only several hours to converge), and my training data for DeepSpeech is rather small (0.5hr of speech). I'm no expert on GPU, so you might want to look for help in DeepSpeech's doc or Discourse.

Contextualist / DeepSpeech-build

wav2vec workflow #1