Closed ayhem18 closed 7 months ago
Regarding (1), the simplest is $h_i = F(x_i)+n_i$, but really that's for when you have to generate samples. Before sampling, you know that the samples would be drawn from a mixture of Gaussians with centers at $F(x_i)$. For bounding MI, you don't need to draw random samples, and you consider $||F(x_i)-F(x_j)||$ (distances between centers) inside the exponential, not $||h_i-h_j||$ (distances between samples). Hope that makes sense.
Regarding (2) , yes I think something like that should work.
Good luck!
Thank you for such a prompt response.
So the my main understanding of the KDE estimation of MI(T;X) is as follows: T is a deterministic function of X, so MI(T;X) is infinite. However, if we consider M as T(X) + Z with Z ~ N(T(x), cov matrix), then MI(M|X) is finite and T can be seen as a mixture of N gaussian components each centered at T(x_i). Using the results from your previous work, we can then bound MI(T|X):
$$ -\frac{1}{N} \sum{i} \log (\frac{1}{N} \sum{j} \exp(-\frac{1}{2\sigma^2 }\cdot || T(x_i) - T(x_j)||^2))$$
Here I don't need to add noise to T(x_i) as the formula uses the centers of the Guassian mixtures and not samples, right ?
I am a completely new to this line of research so, I would admit that my understanding is quite superficial but should be enough for my immediate objectives.
Thanks a lot in advance.
(1) Almost. If X is really just a mix of N delta functions (e.g., N data points), then MI(T;X) is not infinite but bounded by log N. But usually X is continuous and N is just the number of samples we have, and the true value of MI(T;X) is infinity (as approached in the N->infty limit), while MI(T+Z;X) is well-behaved and finite. We can see the KDE estimator , when fed with N samples, as a way to estimate the true value of MI(T+Z;X).
If you wish, you can also imagine that the input X is itself a mixture of Guassians, in which case the MI estimator should use a slightly modified covariance matrix. However, in my experience, this doesn't make a big difference (see Section V in the arxiv version of my paper https://arxiv.org/pdf/1706.02419.pdf [unfortunately the published version has some typos in that section]).
(2) Standard textbook for info theory is Cover & Thomas, it is really good. The best mathematical treatment of IB is https://www.cs.huji.ac.il/labs/learning/Papers/ib_theory.pdf
First of all thank you for sharing your great work.
I am a senior CS bachelor student and I am using the KDE MI estimation suggested by your work as part of a theoretical framework supporting the empirical results I reached so far during my thesis. I have a couple of questions concerning the implementation.
I am not exactly sure I understand this part of the paper. It is clear that I should pass each $x_i$ through the network obtaining $F(x_i)$ and generate some random noise $n_i$ ~ $N(0, \sigma^2 \cdot I)$. However, it is not clear to me if the final layer state should be $h_i = F(x_i) + n_i$ or $h_i = F(x_i) + F(n_i)$ so that MI can be estimated as:
$$ -\frac{1}{N} \sum{i} \log (\frac{1}{N} \sum{j} \exp(-\frac{1}{2\sigma^2 }\cdot || h_i - h_j||^2))$$
$$ -\frac{1}{N} \sum{c \in C} \sum{i | y_i = c} \log (\frac{1}{Nc} \sum{j | y_j = c} \exp(-\frac{1}{2\sigma^2 }\cdot || h_i - h_j||^2))$$
where $C$ is the set of classes $Nc$ the number of samples belonging to class $c$ and $\sum{j | y_j = c}$ iterates through the samples of class $c$
Thanks a lot in advance