Open siyan-sylvia-li opened 2 years ago
Hi @siyan-sylvia-li, As for the vocoder trained with discrete units, we do not plan to release this model soon, so please see this repo: https://github.com/facebookresearch/speech-resynthesis and train it yourself. Regarding HuBERT, I recommend using the fairseq implementation here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm
Thank you so much! I noticed for the speech-resynthesis repo, there is no support for wav2vec 2.0, but in the gslm's unit2speech module, there is a support for wav2vec 2.0. Are the speech-resynthesis code and the gslm's unit2speech code fundamentally different? Thanks again!
@siyan-sylvia-li, Yes, they are quite different. In GSLM it is based on Tacotron2.0 and in speech-resynthesis it is based on Hi-FI GAN. In case you want to use wav2vec2.0, you can extract discrete codes from wav2vec2.0 and use them to train the a unit2speech model from the speech-resynthesis repo
Hello, I have two questions:
Thank you very much for your time!
Hi @siyan-sylvia-li, 1) we train our model on 8GPUs, for 400K iterations. You can see the details in the code: https://github.com/facebookresearch/speech-resynthesis. Training on less GPUs should also work, but probably slower to converge. 2) you need to replace the tokens extracted from HuBERT/CPC with tokens extracted from wav2vec2.0. You should first extract the units for the VCTK corpus from here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit. Then, train your vocoder with these units. You can use this repo for that: https://github.com/facebookresearch/speech-resynthesis
Hello!
We are interested in using the HuBERT model trained / fine-tuned on the Fisher corpus as well as the HiFi-GAN Vocoder that generates audio directly from the units for academic research. Is it possible that these models be released soon? Thank you very much!