coupled_kl_divergence_norm off by orders of 2 to 3 magnitudes

Kevin-Chen0 commented 3 years ago

When I use tf.math to calculate kl divergence:

        logpz = self.log_normal_pdf(z_sample, 0., 1.)
        logqz_x = self.log_normal_pdf(z_sample, mean, logvar)
        kl_div = logqz_x - logpz

I get the following numbers, averaging to 0.26684278.

ipdb> kl_div
<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.86921954,  0.45371056,  0.88609505, -0.01174855,  0.46242428,
        0.88558793,  0.9410727 , -0.3700683 ,  0.84306717, -0.06638861,
        0.37066197,  0.45233607,  0.8508425 ,  0.52803755,  0.9505639 ,
       -0.3642335 ,  0.38080645,  0.9970349 , -0.27189922,  0.97796655,
        0.95682573, -0.32304716, -0.813817  , -0.30250955,  0.6030083 ,
        0.75281763, -1.2111926 , -0.7194972 ,  0.2248416 , -0.5922799 ,
       -0.16742158,  0.05214858, -0.53073287,  0.548347  ,  0.6337991 ,
       -0.40753698,  0.864239  , -1.0780277 ,  0.7774732 ,  0.6771748 ,
        0.80476236, -0.46709728, -1.0554905 ,  0.37865567,  0.7497237 ,
        0.33856797,  0.81753445,  0.8892932 ,  0.3270316 , -1.6759243 ,
        0.4765191 ,  0.64577174,  0.25702858,  0.26793242,  0.8592057 ,
        0.7047727 ,  0.9932246 , -1.3861675 ,  0.10657287,  0.52103424,
        0.56670666,  0.63626647,  0.5903802 , -1.5752082 ,  0.23447895,
       -2.8028917 ,  0.61361504,  0.32030725,  0.77301764, -0.25954676,
       -0.19354391,  0.91773224,  0.4544549 ,  0.6440444 ,  0.9674704 ,
       -0.13501692, -0.4141333 , -0.14588952,  0.07112408,  0.96379423,
        0.96018887, -0.28566027,  0.45304155,  0.64666224,  0.5147927 ,
        0.460351  , -0.42211604,  0.88477373,  0.41102314,  0.663666  ,
        0.86534095,  0.9025917 ,  0.46783733,  0.47456598,  0.71588826,
        0.99136996,  0.08168316,  0.95838964,  0.9762056 , -0.6931119 ,
        0.39269042,  0.7297759 ,  0.70975566,  0.9976181 ,  0.1633246 ,
        0.8174796 ,  0.9928291 ,  0.6770606 ,  0.64429426, -0.60490847,
        0.63297606, -0.11669183,  0.78825736,  0.90766   , -0.7307825 ,
        0.9038205 ,  0.05003333,  0.89798975,  0.58409166, -0.593915  ,
        0.1855967 , -0.3870778 ,  0.96894336,  0.33893538, -0.7414057 ,
        0.6335244 , -2.4104114 ,  0.20711422], dtype=float32)>
ipdb> tf.math.reduce_mean(kl_div)
<tf.Tensor: shape=(), dtype=float32, numpy=0.26684278>

However, when I use coupled_kl_divergence_norm in the following manner:

        x_recons_logits, z_sample, mean, logvar = self.model(x_true)

        q_zx = MultivariateCoupledNormal(loc=mean.numpy(), scale=tf.exp(logvar/2).numpy())
        p_z = MultivariateCoupledNormal(loc=np.zeros(mean.shape), scale=np.ones(logvar.shape))

        kl_div = coupled_kl_divergence_norm(q_zx, p_z, root=False)
        kl_div = tf.convert_to_tensor(kl_div, dtype=tf.float32)

I get the following numbers, averaging to just 0.0007430053.

...
...
       [[ 1.5007547e-03]],

       [[ 1.4787790e-04]],

       [[ 4.7172891e-04]],

       [[ 1.7600998e-03]],

       [[-3.5504821e-05]],

       [[ 2.9015096e-04]],

       [[ 1.0139990e-03]],

       [[ 2.1769719e-03]],

       [[ 3.9246224e-04]],

       [[ 3.0478922e-04]],

       [[ 2.3205217e-03]],

       [[ 1.1474675e-03]],

       [[ 2.5465735e-04]],

       [[ 1.3815672e-03]],

       [[ 1.0416418e-03]]], dtype=float32)>
ipdb> tf.math.reduce_mean(kl_div)
<tf.Tensor: shape=(), dtype=float32, numpy=0.0007430053>

Plz advise.

Kevin-Chen0 commented 3 years ago

Also, in the end of the coupled_kl_divergence_norm, can you also add kl_div.squeeze()? That way, it will give the output of shape (batch_size, ) rather than (batch_size, 1, 1), which is what I'm getting now.

jkclem commented 3 years ago

The output of coupled_kl_divergence_norm suggests you are calculating 15 KL divergences, so your batch size is 15, correct? If not, make sure your inputs to the parameters of the MultivariateCoupledNormal are what you expect them to be. Why is the tensorflow output 1-D? Shouldn't it be (15, 128) under the assumption that there are 15 distributions in the branch and you used 128 samples?

Validate your inputs, and then use the analytic expression to find the KL divergences for each of your distributions in your batch. Compare the analytic results with your outputs.

Kevin-Chen0 commented 3 years ago

The output size I am getting for coupled_kl_divergence_norm is (128, 1, 1). Batch_size is 128. The output you see above is a cutoff of the outputs, so it is more than 15. I have added dots to clarify this.

What I'm looking for is (128, ), and it is supposed to be 1D. Converting from (128, 1, 1) can be fixed by simply adding kl_div.squeeze() in coupled_kl_divergence_norm function.

I will compare the KL analytics expression separately.

jkclem commented 3 years ago

So, you are using 1 random sample to estimate the KL divergence for each distribution?

Edit: Added '.squeeze()' to the coupled_cross_entropy_norm function. This should apply squeeze all the outputs of the dependent functions to 1-D.

Kevin-Chen0 commented 3 years ago

Thanks for adding squeeze(). You can push this to master if u haven't already.

KL Divergence (as currently written in the code) is the difference between the log PDF of posterior q(z|x) and log PDF of prior p(z). Therefore, the output is a scalar per batch, regardless of dim of latent layer z.

        logpz = self.log_normal_pdf(z_sample, 0., 1.)
        logqz_x = self.log_normal_pdf(z_sample, mean, logvar)
        kl_div = logqz_x - logpz

In this bivariate example, z_sample's shape is TensorShape([128, 2]). Here in the loss function context, I'm using the terms sample and batch_size interchangeably. This is not necessarily the case for visualization/display where we sample the number below the batch size but not above. Output of logpz and logqz_x are both TensorShape([128]). So kl_div is also TensorShape([128]), a vector.

jkclem commented 3 years ago

It is pushed to master, but I think you need to push it to PyPI before it will be added to the VAE code.

Yes, entropies, cross-entropies, and divergences are a single value regardless of the dimensionality of the underlying input.

What I am trying to say is this: You are estimating the KL divergence of the two distributions with a single sample per distribution. You are doing Monte Carlo approximation of the KL divergence with a single sample. This is done a fair amount, so it is ok to do, but you cannot compare the Monte Carlo approximation with 1 sample to the coupled_kl_divergence output.

log_normal_pdf's output shape will be (batch_size, number_of_samples). Therefore so will kl_div's. log_normal_pdf gives the log probability density values evaluated at sampled random variables, given distribution parameters. It does not calculate the entropy/cross-entropy. To calculate the entropy of a distribution, or the cross-entropy of two distributions, you need to do summation (discrete) or integration (continuous) over the entire support of the distribution.

If you were to use 100 samples per distribution and then average across the batch distributions so kl_div is (128, 100) and tf.reduce_mean(kl_div) is (128, ), you would have a better Monte Carlo estimate of the KL divergences to compare to coupled_kl_divergence_norm.

Kevin-Chen0 commented 3 years ago

Ok, might have gotten mixed up when you said distribution. I thought u meant the latent layer distribution (which is 2) of the VAE while u r actually referring to the output batch distribution.

When u r proposing here is what @kenricnelson has been alluding to generating (or sampling) multiple output images per input image (from a batch). This is a new frontier of VAE construction as I've only seen at most 1 output value per 1 input value, or 1 sample output from its specified batch in the latent vectors. However, your approach is probably doable and is good for coupled_kl_divergence_norm to have the flexibility to specify n samples. This is what param n will be also for right?

Say that you have dist_p and dist_q, both have inputted loc of shape (128, 4). So batch_size is 128 and z_dim is 4. Then if you pass these distributions into coupled_kl_divergence_norm with n=1, then the output will be (128, ) right, assuming squeezed? So it will still be a vector. However, if u pass in n=100, will the output be (128, 100)? If so, then in the VAE, we can add another tf.reduce_mean(kl_div) to average out these samples and make it (128, ) again no problem, as u said.

So with more samples of the coupled_kl_divergence and then averaging them, would this approach be more comparable to the MC approx?

jkclem commented 3 years ago

I'm not sure if it is a new frontier. If you have means and variances, then there is nothing stopping you from generating normal random variables to use in your MC estimation of the KL divergence.

The n parameter of coupled_kl_divergence_norm sets the number of samples to use in the MC estimate of the KL divergence (technically 1/2 of the samples because there are 2 calls to the coupled_cross_entropy_norm function). This is lowered for speed and increased for accuracy/precision.

The only valid inputs into coupled_kl_divergence_norm are MultivariateCoupledNormal objects, whose number of dimensions can be any positive integer. An array cannot be a distribution, but may be samples from a distribution. coupled_kl_divergence_norm cannot take samples. coupled_kl_divergence_norm outputs will always have shape (n_batches, ).

hxyue1 commented 3 years ago

@Kevin-Chen0 Do you have some example code of how to use ipdb?

EDIT: Nvm, I figured it out

hxyue1 commented 3 years ago

So I've done a little bit of digging and here's what I've found.

The calculation of logpz looks to be incorrect. If you want a standard normal distribution for the latent space, then the input for logvar should be tf.math.log(1.) not simply 1. i.e. logpz should be calculated as logpz = self.log_normal_pdf(z_sample, 0., tf.math.log(1.)).

When kl_div is estimated this way, the mean value (after applying tf.math.reduce_mean) I got from one iteration was around -0.0003, which in orders of magnitude seems to be consistent with the calculation using coupled_kl_divergence_norm. However the sign is wrong.

As John pointed out, the KL divergence calculated in the first way is a Monte Carlo estimator since we are only using one realization per distribution (we have 128 2-dimensional distributions we are sampling from). Since the sample size per distribution is so low, this inevitably adds noise to our estimate. This isn't inherently bad, but I do think the variability does actually bias the estimator somewhat.

Kevin-Chen0 commented 3 years ago

I think I see where is the issue. You have put the posterior as:

q_zx = MultivariateCoupledNormal(loc=mean.numpy(), scale=tf.exp(logvar/2).numpy())

Where you put in the exp of logvar/2 as scale. However, in my original VAE code, I still have it as the following:

logqz_x = self.log_normal_pdf(z_sample, mean, logvar)

However, the following doesn't work atm:

q_zx = MultivariateCoupledNormal(loc=mean.numpy(), scale=logvar.numpy())

That is because of the AssertionError that we didn't take out yet:

AssertionError: All scale values must be greater than 0..

mean and logvar and outputs of the encoder. Since I didn't put any activation function as final layer of encoder, the outputs are raw and may contains negatives function. However, this poses an issue when feeding into MCN scale as Sigma cannot contain negative values. This is an issue that we haven't come up with a resolution yet.

Kevin-Chen0 commented 3 years ago

This is resolved, thanks @hxyue1 and @jkclem!

Photrek / Nonlinear-Statistical-Coupling

coupled_kl_divergence_norm off by orders of 2 to 3 magnitudes #45