speech tokenizer training code

indiejoseph commented 1 month ago

This project is demonstrated a very good way to tokenize a speech with different feature, such as style and pitch tokens, that enable downstream application having fine grained control of the generative voice.

I've tested the speech tokenizer in Cantonese, the output has very strong accent, probably due to the training dataset only contains English, I was wondering how can I trained the tokenizer? I know Fairseq has Hubert and HIFIGan training recipe, but not sure how to about pitch and style feature.

hitchhicker commented 4 weeks ago

@tuanh208 Could you share some insights for this question? Thanks!

tuanh208 commented 4 weeks ago

Hi, I think the reason why the output has strong accent in Chinese is because we only trained the Hifigan vocoder on Expresso (which is in English).

For the pitch tokenizer, as mentioned in the paper, we trained a vqvae model on the extracted f0 (you can use any f0 extractor in this repo) following this work: https://github.com/facebookresearch/speech-resynthesis?tab=readme-ov-file#f0-quantizer-model

For the style tokenizer, we initially fine-tune Speechprop (in this work https://ai.meta.com/research/publications/sonar-expressive-zero-shot-expressive-speech-to-speech-translation/) to predict the styles on the expresso dataset, and train a k-mean tokenizer on the extracted features from speechprop, but for this release we distilled a smaller wav2vec2 model to predict the tokens producted by speechprop, which turns out to work not bad. So let's say if you want to train a new style tokenizer, I would suggest you fine-tune a good speech encoder (e.g. w2v2, wavlm) on some expressive datasets with style labels, and it should work well.

facebookresearch / spiritlm

speech tokenizer training code #9