X-LANCE / VoiceFlow-TTS

[ICASSP 2024] This is the official code for "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching"
https://cantabile-kwok.github.io/VoiceFlow/
306 stars 21 forks source link

Reflow failure #17

Open NathanWalt opened 1 week ago

NathanWalt commented 1 week ago

I've been trying to reproduce your work, especially the rectified flow part. However, the reflow procedure always results in poorer synthesis quality (even for small sampling steps). I'm wondering if you could provide some of your hyperparameters used in the reflow procedure, like training epochs and ema decay rate?

cantabile-kwok commented 1 week ago

Hi, most of the hyperparameters are specified in configs/lj_16k_gt_dur_reflow.yaml. As for the EMA decay rate, it is hardcoded to 0.9999 and we did not change it. For the training epochs, according to my memory, we trained up to 400 epochs for LJspeech and 100 epochs for LibriTTS. But I don't think that is a necessary amount, and I cannot be sure how many epochs are the minimum to reach decent quality.

Can you specify which dataset are you using for not being able to rectify the flow?

NathanWalt commented 1 week ago

I'm also working on LJSpeech dataset, but I'm trying to implement the algorithm based on Grad-TTS's framework and data-preprocessing code, and I use the original frequency 22.05kHz . Unexpectedly the model collapsed after about 50 epochs of rectification. Did you keep generating multiple noise-mel pairs for each utterance for rectification?

cantabile-kwok commented 1 week ago

This sounds weird to me as I never experienced such problems before.

Did you keep generating multiple noise-mel pairs for each utterance for rectification?

No, I just generated a new dataset which was equally sized, 1 generated utterance for each sentence. Those generated samples are kept still (as if it were an off-the-shelf dataset). No re-generation was performed.