Kde Estimation of Mutual Information

ayhem18 commented 7 months ago

First of all thank you for sharing your great work.

I am a senior CS bachelor student and I am using the KDE MI estimation suggested by your work as part of a theoretical framework supporting the empirical results I reached so far during my thesis. I have a couple of questions concerning the implementation.

Given a trained neural network with F as its final output layer, a dataset X and associated labels Y, I would like to estimate MI(F;X).

I am not exactly sure I understand this part of the paper. It is clear that I should pass each $x_i$ through the network obtaining $F(x_i)$ and generate some random noise $n_i$ ~ $N(0, \sigma^2 \cdot I)$. However, it is not clear to me if the final layer state should be $h_i = F(x_i) + n_i$ or $h_i = F(x_i) + F(n_i)$ so that MI can be estimated as:

$$ -\frac{1}{N} \sum{i} \log (\frac{1}{N} \sum{j} \exp(-\frac{1}{2\sigma^2 }\cdot || h_i - h_j||^2))$$

Can I use the following estimation for $MI(F;X |Y)$ (under the same assumptions) ?

$$ -\frac{1}{N} \sum{c \in C} \sum{i | y_i = c} \log (\frac{1}{Nc} \sum{j | y_j = c} \exp(-\frac{1}{2\sigma^2 }\cdot || h_i - h_j||^2))$$

where $C$ is the set of classes $Nc$ the number of samples belonging to class $c$ and $\sum{j | y_j = c}$ iterates through the samples of class $c$

Thanks a lot in advance

artemyk commented 7 months ago

Regarding (1), the simplest is $h_i = F(x_i)+n_i$, but really that's for when you have to generate samples. Before sampling, you know that the samples would be drawn from a mixture of Gaussians with centers at $F(x_i)$. For bounding MI, you don't need to draw random samples, and you consider $||F(x_i)-F(x_j)||$ (distances between centers) inside the exponential, not $||h_i-h_j||$ (distances between samples). Hope that makes sense.

Regarding (2) , yes I think something like that should work.

Good luck!

ayhem18 commented 7 months ago

Thank you for such a prompt response.

Let me quickly recap to make sure I am not blindly applying the formula.

So the my main understanding of the KDE estimation of MI(T;X) is as follows: T is a deterministic function of X, so MI(T;X) is infinite. However, if we consider M as T(X) + Z with Z ~ N(T(x), cov matrix), then MI(M|X) is finite and T can be seen as a mixture of N gaussian components each centered at T(x_i). Using the results from your previous work, we can then bound MI(T|X):

$$ -\frac{1}{N} \sum{i} \log (\frac{1}{N} \sum{j} \exp(-\frac{1}{2\sigma^2 }\cdot || T(x_i) - T(x_j)||^2))$$

Here I don't need to add noise to T(x_i) as the formula uses the centers of the Guassian mixtures and not samples, right ?

I am a completely new to this line of research so, I would admit that my understanding is quite superficial but should be enough for my immediate objectives.

I would greatly appreciate if you can share some useful resources to get more familiar with Mutual Information and the mathematical and theoretical background behind the Information Bottleneck theory in general.

Thanks a lot in advance.

artemyk commented 7 months ago

(1) Almost. If X is really just a mix of N delta functions (e.g., N data points), then MI(T;X) is not infinite but bounded by log N. But usually X is continuous and N is just the number of samples we have, and the true value of MI(T;X) is infinity (as approached in the N->infty limit), while MI(T+Z;X) is well-behaved and finite. We can see the KDE estimator , when fed with N samples, as a way to estimate the true value of MI(T+Z;X).

If you wish, you can also imagine that the input X is itself a mixture of Guassians, in which case the MI estimator should use a slightly modified covariance matrix. However, in my experience, this doesn't make a big difference (see Section V in the arxiv version of my paper https://arxiv.org/pdf/1706.02419.pdf [unfortunately the published version has some typos in that section]).

(2) Standard textbook for info theory is Cover & Thomas, it is really good. The best mathematical treatment of IB is https://www.cs.huji.ac.il/labs/learning/Papers/ib_theory.pdf

artemyk / nonlinearIB

Kde Estimation of Mutual Information #7