X-LANCE / VoiceFlow-TTS

[ICASSP 2024] This is the official code for "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching"
https://cantabile-kwok.github.io/VoiceFlow/
276 stars 20 forks source link

Discussing regarding CFM's density calculation #7

Closed p0p4k closed 6 months ago

p0p4k commented 9 months ago

Is it possible to calculate density like normalising flows and then use it as KL divergence (like vits) for end2end training? (i saw the easter eggs, just wanna know your thoughts about this)

cantabile-kwok commented 9 months ago

Calculating density itself is theoretically possible, but in practice I am not sure this can be applied in end2end training. Density calculation is theoretically possible because flow matching is essentially an ODE, and likelihood in ODE can be computed using the instantaneous change-of-variable formula (as that shown in the "ester eggs"). But if you look at the details of this calculation, the ODE is actually simulated for some number of steps, and in each step there are some randomness introduced by the Skilling-Hutchinson estimator (as we need to sample from noise to compute the divergence of velocity field). This process is hard to be applied to end2end training because:

  1. It is a great cost of time. Calculating density in this way is highly similar to sampling from the ODE in inference, which means you have to call the velocity field estimator for at least several times to obtain a reasonable density estimation. This will substantially increase training time.
  2. The gradient behind this process is too complex. For training, you must preserve the gradients, and note that the ODE simulation is an iterative approach. This means gradients will also be back-propagated along with sampling steps, like the RNNs. Plus, there is some randomness of calculation, the training might be harder.

So, although flow matching and normalizing flow are both "flows", they model the generation process differently. Normalizing flow can be thought of as a one-step alternative of that instantaneous change-of-variable formula, hence there is no iterative computation and no Skilling-Hutchinson estimation of divergence. This difference is important for the realization of such end2end training.

But I think end2end training with flow matching is also feasible. It is more likely to diffusion end2end models. One can still construct a VAE for encoding and decoding speech data, and meanwhile, treat the encoded latent variable as $x_1$ and sample $x_0$ from text-related distributions to train a flow matching between the two. This is then a more natural way to train a flow matching model in the end2end setup.

p0p4k commented 9 months ago

Right, like naturalspeech2. So a two stage training would be the answer I guess. Thanks for your insight.