Open NathanWalt opened 1 week ago
Hi, most of the hyperparameters are specified in configs/lj_16k_gt_dur_reflow.yaml
. As for the EMA decay rate, it is hardcoded to 0.9999 and we did not change it. For the training epochs, according to my memory, we trained up to 400 epochs for LJspeech and 100 epochs for LibriTTS. But I don't think that is a necessary amount, and I cannot be sure how many epochs are the minimum to reach decent quality.
Can you specify which dataset are you using for not being able to rectify the flow?
I'm also working on LJSpeech dataset, but I'm trying to implement the algorithm based on Grad-TTS's framework and data-preprocessing code, and I use the original frequency 22.05kHz . Unexpectedly the model collapsed after about 50 epochs of rectification. Did you keep generating multiple noise-mel pairs for each utterance for rectification?
This sounds weird to me as I never experienced such problems before.
Did you keep generating multiple noise-mel pairs for each utterance for rectification?
No, I just generated a new dataset which was equally sized, 1 generated utterance for each sentence. Those generated samples are kept still (as if it were an off-the-shelf dataset). No re-generation was performed.
I've been trying to reproduce your work, especially the rectified flow part. However, the reflow procedure always results in poorer synthesis quality (even for small sampling steps). I'm wondering if you could provide some of your hyperparameters used in the reflow procedure, like training epochs and ema decay rate?