Open caseybasichis opened 4 years ago
Thanks for your interests! For the training data preparation, you can try following their tutorial to get familiar with the input data structure. The CommonVoice data are good examples, and the length of each training sample should be no more than a few seconds.
The only difference in the workflow involving wav2vec pre-training is that we use .h5context
files produced by wav2vec instead of the .wav
files. (The .h5context
files are those referred as "embeddings" in the wav2vec instructions).
For GPU, I trained my data on single P100. Pre-trainning on wav2vec is not resource-demanding (136hrs of speech took only several hours to converge), and my training data for DeepSpeech is rather small (0.5hr of speech). I'm no expert on GPU, so you might want to look for help in DeepSpeech's doc or Discourse.
Hi C,
Thank you for that summary. I'm very excited to hear this is underway.
I'm new to working with DeepSpeech so I have a bit of catch-up to do before I can get this going.
Are there any wav data particulars that need to be followed?
I currently have a 1080 ti. Planning on getting a 3090 when they are available, or possibly a rtx 8000. Are any of those adequate for training this?
Should I start a Discourse thread?