Open indiejoseph opened 1 month ago
@tuanh208 Could you share some insights for this question? Thanks!
Hi, I think the reason why the output has strong accent in Chinese is because we only trained the Hifigan vocoder on Expresso (which is in English).
For the pitch tokenizer, as mentioned in the paper, we trained a vqvae model on the extracted f0 (you can use any f0 extractor in this repo) following this work: https://github.com/facebookresearch/speech-resynthesis?tab=readme-ov-file#f0-quantizer-model
For the style tokenizer, we initially fine-tune Speechprop (in this work https://ai.meta.com/research/publications/sonar-expressive-zero-shot-expressive-speech-to-speech-translation/) to predict the styles on the expresso dataset, and train a k-mean tokenizer on the extracted features from speechprop, but for this release we distilled a smaller wav2vec2 model to predict the tokens producted by speechprop, which turns out to work not bad. So let's say if you want to train a new style tokenizer, I would suggest you fine-tune a good speech encoder (e.g. w2v2, wavlm) on some expressive datasets with style labels, and it should work well.
This project is demonstrated a very good way to tokenize a speech with different feature, such as style and pitch tokens, that enable downstream application having fine grained control of the generative voice.
I've tested the speech tokenizer in Cantonese, the output has very strong accent, probably due to the training dataset only contains English, I was wondering how can I trained the tokenizer? I know Fairseq has Hubert and HIFIGan training recipe, but not sure how to about pitch and style feature.