Influence of `sigma_t` in `loss_t`

X-LANCE / VoiceFlow-TTS

[ICASSP 2024] This is the official code for "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching"

276 stars 20 forks source link

Hi, thanks for your great work. I notice that you add a small gaussian noise when sampling x_t in CFM:

mu_t = t_unsqueeze * x1 + (1 - t_unsqueeze) * x0  # conditional Gaussian mean
sigma_t = self.sigma_min
x = mu_t + sigma_t * torch.randn_like(x1)  # sample p_t(x|x_0, x_1)

This matches the description in your paper. However, I see most other works of rectified flow do not use this sigma_t and they simply use the mean value mu_t as the sampled x_t. I wonder if you have explored on how big influence this sigma_t has on the model performance, and what is the appropriate range of values for sigma_t ? Thanks for your help in advance.

Thank you for the question, and I think this is a really good one. Honestly I don't have a solid answer to this, because in the conditional flow matching paper (https://arxiv.org/abs/2302.00482v1) they have the small $\sigma$, but in rectified flow (https://arxiv.org/abs/2209.03003) this $\sigma$ does not exist. From a mathematical perspective, the boundary condition of the conditional probability field $p(x_t|x_1,x_0)$ requires the $\sigma$ to be small, and it does not seem to have a great theoretical impact if we just set it to 0.

Personally, I have not done investigations on this tricky $\sigma$, but some time ago someone told me setting $\sigma$ to a smaller value than the current one "seemed to lead to worse performance". If this observation is true, then I guess a non-zero $\sigma$ helps to "smooth" the flow matching trajectory so that the model does not only learn on the line between $x_1$ and $x_0$, but also the regions nearby. This is just an intuition though, and more empirical evidence is needed to verify this.

X-LANCE / VoiceFlow-TTS

Influence of `sigma_t` in `loss_t` #10