Open StephennFernandes opened 9 months ago
I've already checked that, i need to pretrain on my unsupervised dataset to further get more performance on downstream tasks.
The encoder Wav2Vec-BERT 2.0 has been trained on 4.5 million hours of unlabeled audios comprising of 143 languages, and the newer version was trained on more low-resource languages. Check the Section 3.2.1 of the Seamless' paper to know more about pre-training of this model. So the point is that training the encoder with your own data would just make it forget its vast vocabulary of speech patterns and its not recommended.
But even then, Check this model card of Wav2Vec-BERT 2.0 for more information on finetuning for your custom set of languages.
i have a bunch of private unlabelled speech corpuses for Indian language families, hence given that its an obvious choice that i would want to continually pre-train the w2v-BERT2.0 model on my extended dataset that i contain, to apparently try to yield maximum performance of downstream tasks.
hence, please if possible may i know if there is a pretraining script available for w2v-BERT 2.0 that i could possibly use to further pretrained checkpoint for improved performance ?
additionally could you also tell me ideal h-params to choose for ideal continual pretraining of the model