facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.94k stars 1.06k forks source link

scripts to reproduce w2v-BERT 2.0 pretraining ? #342

Open StephennFernandes opened 9 months ago

StephennFernandes commented 9 months ago

i have a bunch of private unlabelled speech corpuses for Indian language families, hence given that its an obvious choice that i would want to continually pre-train the w2v-BERT2.0 model on my extended dataset that i contain, to apparently try to yield maximum performance of downstream tasks.

hence, please if possible may i know if there is a pretraining script available for w2v-BERT 2.0 that i could possibly use to further pretrained checkpoint for improved performance ?

additionally could you also tell me ideal h-params to choose for ideal continual pretraining of the model

StephennFernandes commented 9 months ago

I've already checked that, i need to pretrain on my unsupervised dataset to further get more performance on downstream tasks.

Awaisn25 commented 9 months ago

The encoder Wav2Vec-BERT 2.0 has been trained on 4.5 million hours of unlabeled audios comprising of 143 languages, and the newer version was trained on more low-resource languages. Check the Section 3.2.1 of the Seamless' paper to know more about pre-training of this model. So the point is that training the encoder with your own data would just make it forget its vast vocabulary of speech patterns and its not recommended.

But even then, Check this model card of Wav2Vec-BERT 2.0 for more information on finetuning for your custom set of languages.