Closed Kevin-Chen0 closed 3 years ago
Also, in the end of the coupled_kl_divergence_norm
, can you also add kl_div.squeeze()
? That way, it will give the output of shape (batch_size, )
rather than (batch_size, 1, 1)
, which is what I'm getting now.
The output of coupled_kl_divergence_norm
suggests you are calculating 15 KL divergences, so your batch size is 15, correct? If not, make sure your inputs to the parameters of the MultivariateCoupledNormal are what you expect them to be. Why is the tensorflow output 1-D? Shouldn't it be (15, 128) under the assumption that there are 15 distributions in the branch and you used 128 samples?
Validate your inputs, and then use the analytic expression to find the KL divergences for each of your distributions in your batch. Compare the analytic results with your outputs.
The output size I am getting for coupled_kl_divergence_norm
is (128, 1, 1). Batch_size is 128. The output you see above is a cutoff of the outputs, so it is more than 15. I have added dots to clarify this.
What I'm looking for is (128, ), and it is supposed to be 1D. Converting from (128, 1, 1) can be fixed by simply adding kl_div.squeeze()
in coupled_kl_divergence_norm
function.
I will compare the KL analytics expression separately.
So, you are using 1 random sample to estimate the KL divergence for each distribution?
Edit: Added '.squeeze()' to the coupled_cross_entropy_norm function. This should apply squeeze all the outputs of the dependent functions to 1-D.
Thanks for adding squeeze(). You can push this to master if u haven't already.
KL Divergence (as currently written in the code) is the difference between the log PDF of posterior q(z|x)
and log PDF of prior p(z)
. Therefore, the output is a scalar per batch, regardless of dim of latent layer z.
logpz = self.log_normal_pdf(z_sample, 0., 1.)
logqz_x = self.log_normal_pdf(z_sample, mean, logvar)
kl_div = logqz_x - logpz
In this bivariate example, z_sample's shape is TensorShape([128, 2]). Here in the loss function context, I'm using the terms sample and batch_size interchangeably. This is not necessarily the case for visualization/display where we sample the number below the batch size but not above. Output of logpz and logqz_x are both TensorShape([128]). So kl_div is also TensorShape([128]), a vector.
It is pushed to master, but I think you need to push it to PyPI before it will be added to the VAE code.
Yes, entropies, cross-entropies, and divergences are a single value regardless of the dimensionality of the underlying input.
What I am trying to say is this: You are estimating the KL divergence of the two distributions with a single sample per distribution. You are doing Monte Carlo approximation of the KL divergence with a single sample. This is done a fair amount, so it is ok to do, but you cannot compare the Monte Carlo approximation with 1 sample to the coupled_kl_divergence
output.
log_normal_pdf
's output shape will be (batch_size, number_of_samples)
. Therefore so will kl_div
's. log_normal_pdf
gives the log probability density values evaluated at sampled random variables, given distribution parameters. It does not calculate the entropy/cross-entropy. To calculate the entropy of a distribution, or the cross-entropy of two distributions, you need to do summation (discrete) or integration (continuous) over the entire support of the distribution.
If you were to use 100 samples per distribution and then average across the batch distributions so kl_div
is (128, 100)
and tf.reduce_mean(kl_div)
is (128, )
, you would have a better Monte Carlo estimate of the KL divergences to compare to coupled_kl_divergence_norm
.
Ok, might have gotten mixed up when you said distribution. I thought u meant the latent layer distribution (which is 2) of the VAE while u r actually referring to the output batch distribution.
When u r proposing here is what @kenricnelson has been alluding to generating (or sampling) multiple output images per input image (from a batch). This is a new frontier of VAE construction as I've only seen at most 1 output value per 1 input value, or 1 sample output from its specified batch in the latent vectors. However, your approach is probably doable and is good for coupled_kl_divergence_norm
to have the flexibility to specify n samples. This is what param n
will be also for right?
Say that you have dist_p and dist_q, both have inputted loc of shape (128, 4). So batch_size is 128 and z_dim is 4. Then if you pass these distributions into coupled_kl_divergence_norm
with n=1, then the output will be (128, ) right, assuming squeezed? So it will still be a vector. However, if u pass in n=100, will the output be (128, 100)? If so, then in the VAE, we can add another tf.reduce_mean(kl_div)
to average out these samples and make it (128, ) again no problem, as u said.
So with more samples of the coupled_kl_divergence
and then averaging them, would this approach be more comparable to the MC approx?
I'm not sure if it is a new frontier. If you have means and variances, then there is nothing stopping you from generating normal random variables to use in your MC estimation of the KL divergence.
The n
parameter of coupled_kl_divergence_norm
sets the number of samples to use in the MC estimate of the KL divergence (technically 1/2 of the samples because there are 2 calls to the coupled_cross_entropy_norm
function). This is lowered for speed and increased for accuracy/precision.
The only valid inputs into coupled_kl_divergence_norm
are MultivariateCoupledNormal
objects, whose number of dimensions can be any positive integer. An array cannot be a distribution, but may be samples from a distribution. coupled_kl_divergence_norm
cannot take samples. coupled_kl_divergence_norm
outputs will always have shape (n_batches, )
.
@Kevin-Chen0 Do you have some example code of how to use ipdb?
EDIT: Nvm, I figured it out
So I've done a little bit of digging and here's what I've found.
logpz
looks to be incorrect. If you want a standard normal distribution for the latent space, then the input for logvar
should be tf.math.log(1.)
not simply 1.
i.e. logpz
should be calculated as logpz = self.log_normal_pdf(z_sample, 0., tf.math.log(1.))
.When kl_div
is estimated this way, the mean value (after applying tf.math.reduce_mean
) I got from one iteration was around -0.0003, which in orders of magnitude seems to be consistent with the calculation using coupled_kl_divergence_norm
. However the sign is wrong.
I think I see where is the issue. You have put the posterior as:
q_zx = MultivariateCoupledNormal(loc=mean.numpy(), scale=tf.exp(logvar/2).numpy())
Where you put in the exp of logvar/2 as scale. However, in my original VAE code, I still have it as the following:
logqz_x = self.log_normal_pdf(z_sample, mean, logvar)
However, the following doesn't work atm:
q_zx = MultivariateCoupledNormal(loc=mean.numpy(), scale=logvar.numpy())
That is because of the AssertionError that we didn't take out yet:
AssertionError: All scale values must be greater than 0.
.
mean
and logvar
and outputs of the encoder. Since I didn't put any activation function as final layer of encoder, the outputs are raw and may contains negatives function. However, this poses an issue when feeding into MCN scale as Sigma cannot contain negative values. This is an issue that we haven't come up with a resolution yet.
This is resolved, thanks @hxyue1 and @jkclem!
When I use tf.math to calculate kl divergence:
I get the following numbers, averaging to 0.26684278.
However, when I use
coupled_kl_divergence_norm
in the following manner:I get the following numbers, averaging to just 0.0007430053.
Plz advise.