Closed p0p4k closed 6 months ago
Calculating density itself is theoretically possible, but in practice I am not sure this can be applied in end2end training. Density calculation is theoretically possible because flow matching is essentially an ODE, and likelihood in ODE can be computed using the instantaneous change-of-variable formula (as that shown in the "ester eggs"). But if you look at the details of this calculation, the ODE is actually simulated for some number of steps, and in each step there are some randomness introduced by the Skilling-Hutchinson estimator (as we need to sample from noise to compute the divergence of velocity field). This process is hard to be applied to end2end training because:
So, although flow matching and normalizing flow are both "flows", they model the generation process differently. Normalizing flow can be thought of as a one-step alternative of that instantaneous change-of-variable formula, hence there is no iterative computation and no Skilling-Hutchinson estimation of divergence. This difference is important for the realization of such end2end training.
But I think end2end training with flow matching is also feasible. It is more likely to diffusion end2end models. One can still construct a VAE for encoding and decoding speech data, and meanwhile, treat the encoded latent variable as $x_1$ and sample $x_0$ from text-related distributions to train a flow matching between the two. This is then a more natural way to train a flow matching model in the end2end setup.
Right, like naturalspeech2. So a two stage training would be the answer I guess. Thanks for your insight.
Is it possible to calculate density like normalising flows and then use it as KL divergence (like vits) for end2end training? (i saw the easter eggs, just wanna know your thoughts about this)