FilippoMB / Spectral-Clustering-with-Graph-Neural-Networks-for-Graph-Pooling

Experimental results obtained with the MinCutPool layer as presented in the 2020 ICML paper "Spectral Clustering with Graph Neural Networks for Graph Pooling"
https://arxiv.org/abs/1907.00481
MIT License
261 stars 48 forks source link

missing definition of \widetilde{D} in Equation (6) #6

Closed longbowzhang closed 3 years ago

longbowzhang commented 3 years ago

Hi, @FilippoMB @danielegrattarola , I have several questions w.r.t Equation (6).

  1. what's the definition of image? if it is normalized in the same way as the adjacency matrix, it is thus an Identity matrix, isn't it?
  2. in Equation (3), you use the adjacency matrix, but in Equation (6), you use the normalized adjacency matrix, why?

Thank you very much in advance.

FilippoMB commented 3 years ago

Hi @longbowzhang !

\tilde{D} in Eq. 6 is the degree matrix of \tilde{A} and you can compute it, for example, by summing the columns of \tilde{A}. \tilde{D} is a diagonal matrix but, in general, is different from the identity matrix because \tilde{A} = D^{-1/2}AD^{-1/2}. On the other hand, if we were defining \tilde{A} = D^{-1}A, then \tilde{D} would have been an identity matrix.

The reason why the loss in Eq. 6 is defined using \tilde{A} and \tilde{D}, rather than A and D like in Eq. 3, is purely for convenience. Indeed, the message-passing layers that usually come before and after the pooling layer require as input \tilde{A}. Since tr(S^T A S)/tr(S^T D S) = tr(S^T \tilde{A} S)/tr(S^T \tilde{D} S), it is more convenient to use the latter formulation so we can feed both the message-passing and pooling layer with the same matrix, rather than carrying with us both \tilde{A} and A.

If you look into the code this convenience should be clear: we compute \tilde{A} in the beginning and then forget about it.

Cheers :)

longbowzhang commented 3 years ago

Hi @FilippoMB thanks a lot for your detailed reply, but I still have two followup questions.

  1. According to Eqn. (3), the objective function is sum(a1/d1 + a2/d2 + ...) which in my opinion is NOT equal to sum(a1+a2+...) / sum(d1+d2+...) (i.e. trace(S^T \tilde{A} S)/trace(S^T \tilde{D} S)) as in Eqn. (6)), right? In other words, the order of sum and division is different, right?

  2. Additionally, are you sure that Q^T Q = I as shown in Eqn (4) in your paper, since it conflicts with the Eqn. (14-16) from (Yu & Shi, 2003). Specifically, Q is analogy to the Z in Eqn (16) of (Yu & Shi, 2003) and it is clear that Z^T Z != I. Thus, I am wondering whether I overlook some important facts?

Looking forward to your further explanations.

FilippoMB commented 3 years ago

Hi @longbowzhang ,

  1. The example you are making is different from the quadratic forms that appear on the paper. This simple example might help you to understand the equivalence of the two formulations in Eq. 3 and Eq. 6: https://colab.research.google.com/drive/1ACInzu-_tir94ORw3D1cjjKIKDY6eLoA?usp=sharing

  2. The orthogonality in Eq. 4 is a constraint of the optimization problem, which is satisfied when Q is a matrix whose columns are the eigenvectors of A. Indeed, since the eigenvectors are orthonormal, Q^TQ = I.

Cheers :)

longbowzhang commented 3 years ago

Hi @FilippoMB

(1) I tried again and find out that they are still not equivalent. They are just numerically close. This is my experiment https://colab.research.google.com/drive/1ggO6YMwJjchqwlz0AfUjcBXmYC7kWIXb?usp=sharing Is there any bug that I fail to identify?

(2) I think I fail to express myself clearly. The following figure is from (Yu & Shi, 2003). image

Here X is the binary indicator function, equivalent to C in your Eqn. (3 & 4). Z, in the above figure, is equivalent to Q in your Eqn. (4). 2.1 Until here, do you agree with each other?

2.2 Apparently, the constraints are different, right? One is Z^T D Z = I, and the other is Q^T Q = I. Thus Z and Q are the same objects or not?

Best :)

ShichangZh commented 3 years ago

Hi @FilippoMB. Thank you for the great work. I really like the MinCut pooling idea, and I had exactly the same two questions as @longbowzhang while I was reading the paper. Then I found you are discussing them here. I am really curious about which part I was missing.

@longbowzhang could you please open the access to your experiment? It seems to be locked right now. I did some experiments on my end, and I found equation (3) and equation (6) are slightly different, though the difference is small.

Best

longbowzhang commented 3 years ago

Hi @ShichangZh

could you please open the access to your experiment? It seems to be locked right now.

Sorry, not that familiar with colab. Now I updated the colab link in my previous post. My experiment is quite simple, I just show that Eqn (3) is "first division then sum" while Eqn (6) is "first sum then division".

equation (3) and equation (6) are slightly different, though the difference is small.

In my opinion the results are numerically similar but not mathematically equivalent.

Actually I am also curious whether I misunderstand or overlook some important facts. Looking to forward to @FilippoMB 's further explanation as well.

FilippoMB commented 3 years ago

Dear @longbowzhang and @ShichangZh ,

The cluster assignment matrix S that we want to learn should enhance some properties of the data described by a matrix A, while avoiding some of the properties described by a second matrix B. This can be obtained by maximizing at the same time the term tr(S^TAT), while minimizing tr(S^TBS), which represents a penalty term. A natural way of formulating this objective is to consider the ratio trace tr(S^TAT)/tr(S^TBS).

In the case of the normalized cut, the matrix B is the matrix of vertex degrees D, which acts as a scale normalization constraint. Notice that a similar formulation can be found in other graph embedding algorithms, such as Linear Discriminant Analysis, Marginal Fisher Information, Local Discriminant Embedding, and so on.

Finding an S that maximizes the ratio trace tr(S^TAT)/tr(S^TDS) translates into an optimization problem that is generally non-convex and, therefore, does not admit a close-form solution. For this reason, the ratio trace is often approximated by an inexact trace ratio problem: argmax_S tr[(S^TAT)/(S^TDS)] = tr[(S^TDS)^-1(S^TAT)]. Now we have to maximize the trace of a single matrix rather than a ratio of traces and the optimization problem can be directly solved by the generalized eigenvalue decomposition: A u_k = \lambda_k B u_k. This is exactly what is done by spectral clustering, as described in Eq. 3 in our paper.

However, in our case, we are in a neural network setting, where we use gradient descent to optimize the parameters of our model. Therefore, we are not interested in a closed-form formulation of our optimization problem, and we very much prefer to optimize the original formulation of our objective, the ratio trace, rather than the trace ratio which is an inexact approximation of it.

On top of it, there are few other things to keep in mind:

About your second point, you are totally right, there was an error in Eq. 4! Thanks a lot for pointing it out 👍 We wanted to express the orthogonality constraint in Eq. 4 without the degree matrix, to highlight the connection with our loss function in Eq. 6. However, we forgot to normalize Q accordingly. Here's the correct formulation:

Screenshot 2020-12-14 135907

Cheers :)

longbowzhang commented 3 years ago

Dear @FilippoMB Thanks a lot for your detailed explanation. Let me shortly summarize for anyone who might be interested in this topic as well.

Therefore, I will close this issue up. All the best!