ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.8k stars 12.74k forks source link

[QUESTION] KL divergence formula for the regularizer layer needs explication #561

Open hansglick opened 2 years ago

hansglick commented 2 years ago

Hi @ageron ,

cell n°44 @ https://github.com/ageron/handson-ml2/blob/master/17_autoencoders_and_gans.ipynb, you build a KLDivergence Layer, but the formula you use is little difficult to understand at least for me,

Why

kl_divergence(self.target, mean_activities) +
kl_divergence(1. - self.target, 1. - mean_activities)

?

and not simply kl_divergence(self.target, mean_activities) ?

ageron commented 2 years ago

Hi @hansglick ,

That's a great question, thanks!

The KL divergence equation computes the divergence between two probability distributions (see my video on this topic). For example, if the probability of activation is 0.4 but we actually want it to be 0.1 (for sparsity), then the correct equation is:

>>> import numpy as np
>>> 0.1 * np.log(0.1 / 0.4) + (1 - 0.1) * np.log((1 - 0.1) / (1 - 0.4))
0.22628916118535888

This includes the probability of activation (0.4) and the probability of no-activation (1-0.4), since we need a full probability distribution.

Or we can use the kullback_leibler_divergence() function from the tensorflow.keras.losses package to get the same result as a tensor:

>>> from tensorflow.keras.losses import kullback_leibler_divergence
>>> kullback_leibler_divergence([0.1, 1-0.1], [0.4, 1-0.4])
<tf.Tensor: shape=(), dtype=float32, numpy=0.2262891>

Another way to get the same result is to call the kullback_leibler_divergence twice, once with just the probability of activation, and once with just the probability of no-activation:

>>> kullback_leibler_divergence([0.1], [0.4]) + kullback_leibler_divergence([1-0.1], [1-0.4])
<tf.Tensor: shape=(), dtype=float32, numpy=0.2262891>

This last option was less verbose than the previous option, since it does not require concatenating the probabilities into a single tensor.

I think I'll add a note in the notebook about this, I agree it's not intuitive. Thanks again!

hansglick commented 2 years ago

@ageron Thank you Sir for your great explanation and your time.