facebookresearch / textlesslib

Library for Textless Spoken Language Processing
MIT License
528 stars 51 forks source link

Model Release for "Generative Spoken Dialogue Language Modeling"? #18

Open siyan-sylvia-li opened 2 years ago

siyan-sylvia-li commented 2 years ago

Hello!

We are interested in using the HuBERT model trained / fine-tuned on the Fisher corpus as well as the HiFi-GAN Vocoder that generates audio directly from the units for academic research. Is it possible that these models be released soon? Thank you very much!

adiyoss commented 2 years ago

Hi @siyan-sylvia-li, As for the vocoder trained with discrete units, we do not plan to release this model soon, so please see this repo: https://github.com/facebookresearch/speech-resynthesis and train it yourself. Regarding HuBERT, I recommend using the fairseq implementation here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm

siyan-sylvia-li commented 2 years ago

Thank you so much! I noticed for the speech-resynthesis repo, there is no support for wav2vec 2.0, but in the gslm's unit2speech module, there is a support for wav2vec 2.0. Are the speech-resynthesis code and the gslm's unit2speech code fundamentally different? Thanks again!

adiyoss commented 2 years ago

@siyan-sylvia-li, Yes, they are quite different. In GSLM it is based on Tacotron2.0 and in speech-resynthesis it is based on Hi-FI GAN. In case you want to use wav2vec2.0, you can extract discrete codes from wav2vec2.0 and use them to train the a unit2speech model from the speech-resynthesis repo

siyan-sylvia-li commented 2 years ago

Hello, I have two questions:

  1. We are thinking about training a unit2speech HiFiGAN, and could not find any training details in the speech resynthesis paper. How many GPU's / hours did it take to train a HiFiGAN, roughly?
  2. Can you provide more detailed instructions for how we can use wav2vec 2.0 to quantize and then train? I see configuration files in the speech-resynthesis repo that uses other quantization models including HuBERT and CPC, but it is still not clear to me as to how I should adapt the existing config files to using my own quantization models to potentially encode different datasets.

Thank you very much for your time!

adiyoss commented 2 years ago

Hi @siyan-sylvia-li, 1) we train our model on 8GPUs, for 400K iterations. You can see the details in the code: https://github.com/facebookresearch/speech-resynthesis. Training on less GPUs should also work, but probably slower to converge. 2) you need to replace the tokens extracted from HuBERT/CPC with tokens extracted from wav2vec2.0. You should first extract the units for the VCTK corpus from here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit. Then, train your vocoder with these units. You can use this repo for that: https://github.com/facebookresearch/speech-resynthesis