Computation of h(w,n), and issues of your paper

pengkui commented 11 years ago

It is not a bug of your code, but I just wanted to point out that the definition of $h(w,n) = -w log_2 (w/n)$ in your arxiv paper (http://arxiv.org/pdf/1110.2515.pdf) does not match your C++ code (onmi.cpp: 178-190), with your code being the correct one. In addition, there is no mentioning about the computation of H(X) in your paper. You may want to correct your paper. - Pengkui

aaronmcdaid commented 11 years ago

Hi, I agree with you that the code is correct. But I believe it is the same as on the arxiv paper - I cannot see any discrepancy.

double H(const int x, const int N) {
        if(x==0)
                return 0.0;
        const double Px = double(x) / double(N);
        assert(x>0 && Px > 0.0);
        return -x * log2(Px);
}

This code effectively means return -x * log2( x / N). This is the same as the latex expression in the paper, with x instead of w. $h(w,n) = -w log_2 (w/n)$ . Perhaps you could clarify further if you believe there is still a problem?

Thanks for your feedback.

aaronmcdaid commented 11 years ago

I agree with your other point, that I forgot to define H(X) in my paper. Thanks for that. My arXiv paper defines H(X|Y) but it does not define H(X). Hopefully, I will be able to expand on that sometime.

pengkui commented 11 years ago

For the first part, I meant to say that $h(w,n) = -w log_2 {w/n}$ in your paper should be corrected to $h(w,n) = -w/n log_2 {w/n}$. The latter is the function double h (const double p) you use in your function H_X_given_Y() to compute h(a,n) etc.

aaronmcdaid commented 11 years ago

I defined two functions, H(const int x, const int N) and h(const double p), one after the other in the file. I should have used different names to distinguish them clearly! I think we are agreed that the code implemented is correct. The question is about the correctness of the maths in the arXiv document. I will agree that there is an apparent difference between the maths for equation (4) and the code that implements it, but I believe I can show they are equivalent. But first, we must look at equation (1,2,3).

I believe that $h(w,n) = -w log_2 {w/n}$ is correct. The number of bits taken to encode a single event is $log_2{p}$, where $ p $ is the probability of the event. In this case, $ p = w/n $ as there are $ w $ nodes that are in (or are not in) the relevant communities, out of a total of $ n $ nodes.

In this problem, there are four possible states, $ a+b+c+d = n$. The average number of bits to encode the state of one node would then be the weighted average of the various $log_2{a} , log_2{b}, log_2{c}, log_2{d}$. The exact answer for this weighted average is then $ a/n log_2{a/n} + b/n log_2{b/n} + c/n log_2{c/n} + d/n log_2{d/n} $.

That is the average number of bits to encode one event. But there are $ n $ nodes and we need to encode the total number of bits to encode all the information in the $ n $ events. This requires multiplying by $ n $. This leads to the form with $ a log_2{a/n}$.

In short, the total number of bits is calculated by summing $ x log_2{x/n} $, where $ x $ is the absolute number of the events of a particular type, and $ x/n $ is the proportion of all nodes that have type.

That describes the formula for equations (1,2,3) in the paper for H(X_i|Y_j), we are calculating the total number of bits to encode all the information across all nodes, not just the average number of bits to encode the state of one node.

Finally, however, there is an apparent contradiction between my equation (4) in the paper and the particular implementation I used in my software. In my paper, equation (4) asks us to compare $ a log_2{a/n} + d log_2{d/n} $ to $ b log_2{b/n} + c log_2{c/n} $ and to make a decision based on which is larger. In my code I have instead compared $ a/n log_2{a/n} + d/n log_2{d/n} $ to $ b/n log_2{b/n} + c/n log_2{c/n} $.

The interesting thing is that these two comparisons are effectively identical! By multiplying, or dividing, both sides of a comparison the same number, $n$, the result of the comparison will be unchanged. Therefore, it is safe for me to use a slightly different expression - the result can be proven to be the same.

I apologise that I have written such a long reply. I think there are essentially two distinct issues:

Is the equation for H(X_i|Y_j) correct? In particular, does it use $h(w,n) = -w log_2 {w/n}$ in an appropriate way to calculate the total number of bits correctly?
Is equation (4) correct, the equation that identifies whether or not a particular pair of clusterings are negatively correlated?

If there are further issues, then perhaps it might be useful to identify what we agree on as well as identify what we might disagree on.

Thanks again, Aaron

aaronmcdaid commented 11 years ago

I think this issue can be closed now. Unless there are further questions? I have submitted an updated document to arXiv that defines H(X) clearly, it may take a few days before it is visible.

aaronmcdaid / Overlapping-NMI

Computation of h(w,n), and issues of your paper #1