cagatayyildiz / ODE2VAE

ODE2VAE: Deep generative second order ODEs with Bayesian neural networks
MIT License
124 stars 27 forks source link

A question about the loss function in paper. #3

Closed y18810919727 closed 3 years ago

y18810919727 commented 3 years ago

Hi, I am a ML researcher and majoring in sequential VAE model. The work ODE2VAE interests me a lot. But I have a question about the ELBO loss function in Equ(16) which is unelated to the code. image When you construct a posterior model by using second-order ODE-net, the joint posterior log-prob of sampled hidden states sequence could be expressed as : log q(z0z1z2...zN) = log q(z0) +log q(z1|z0) +log q(z2|z0) + ... + log q(zN|z0) = log q(z0) Deterministic ODE system makes the conditional probability distribution of any q(zi| z0) is a delta distribuition and the probability density is equal to 1.

Therefore, the KL divergence of posterior distribution on prior distribution KL[q(z0z1z2...zN), p(z0z1z2...zN)] shouldn't contain the items zi when i>0.

I think the ELBO loss in Appendix E from paper 'Knauf, A. (2018). Neural Ordinary Differential Equations. UNITEXT - La Matematica per Il 3 Piu 2, 109(NeurIPS), 31–60. https://doi.org/10.1007/978-3-662-55774-7_3' is a more ratioinal one.

While in the paper of ODE2VAE, the KL terms are calculated on all of the hidden states by introducing marginal probability distribution log q(zi) for each hidden state zi. How can I understand the novel loss function which is against my intuition? It would be kind if you could answer my question.

Thanks a lot !

cagatayyildiz commented 3 years ago

I agree that given a fixed initial value z_0, q(z_i|z_0) is a delta distribution (following an ODE flow). But now, the catch is that z_0 is a random variable. In other words, the flow would change the initial density into some other density q(z_i|z_0) that we penalize (via the KL term) to prevent it from collapsing. Maybe the confusing part is that the procedure is particle-based (thus it seems there is no need to compute the density); however, each particle is a sample from q(z_i|z_0). Then, for a random variable z~p(z), x=f(z) would follow eq.3 here.

y18810919727 commented 3 years ago

Yeah, thanks for your response. I have got it while maybe I found a more rigorous explanation. Given two posterior distribution over hidden sequences, Q(z0, z1, z2) and O(z0, z1, z2) Distribution Q is independently distributionand with each positions i and the joint prob density of Q(z0, z1, z2...) = log q(z0) + log q(z1) + ... + log q(zN) . It is the real posterior distribution we want to optimize. Distribution O is not independent with each positions which has the joint prob density O(z0, z1, z2...) = log q(z0) + log q(z1|z0) ... + log q(zN|z0) = log q(z0) . It corresponds to the distribution from which we sample z0, z1, z2... by sampling z0 and solving ODE system.

We define the hidden sequences Z={z0, z1, ... zn} and gauss prioir distribution as P(Z). The optimized target KL(Q(Z)||P(Z)) is equal to the expection of log [Q(Z)/P(Z)] with Zs sampled from Q(Z). Interestingly, because the Q and P are all independent with each index, log Q(Z) = log q(z0) + log q(z1) + ... + log q(zN), the expection of log Q(Z) with Zs sampled from Q(Z) is equivalent to the expection of log Q(Z) with Zs sampled from O(Z). And the same is true to log[P(Z)]. Obviously, sampling from O(Z) is much more efficent and direct compared with sampling from Q(Z).


Once again, thanks a lot for your kind responce. Best wishes!