flyingtango / DiGCN

Implement of DiGCN, NeurIPS-2020
MIT License
45 stars 8 forks source link

Differences in results between DiGCN and DGCN #1

Closed vymao closed 3 years ago

vymao commented 3 years ago

Hi,

I was wondering, what is the large discrepancy in accuracy for benchmarking methods (like GCN/GAT, for example) between Table 1 in the DGCN paper and Table 2 in the DiGCN paper? The accuracy levels seem very different for the same dataset.

flyingtango commented 3 years ago

Hi Victor, The differences in the results are due to two main factors. First, the experimental tasks in the two papers are different. Although DGCN uses digraph datasets, their inputs are still symmetric adjacency matrices, which is why baselines in DGCN are similar to the original results in undirected graphs. Our experiments in DiGCN, however, restrict the inputs to asymmetric adjacency matrices in order to measure their performance in digraphs. Please see Sec 5.1 for the experimental task and Sec 6.1 for analysis. Second, DGCN concatenates second-order in & out matrix to obtain neighbour features, which could lead to a significant increase in the num of edges and cause OOM problem. Because of this structure, DGCN can get a larger receptive field than a normal GCN, and then get a small boost.

DiGCN goes beyond DGCN not only for better performance under more stringent experimental conditions but also for better interpretability. DGCN uses simple symmetric matrices for first-order proximity in its Eq.(8) and does not explain why directed structural features can be obtained. This is the reason why we have improved a lot on DGCN and proposed DiGCN.

vymao commented 3 years ago

Thanks. I am also wondering:

  1. What is W in Equation 6 exactly? How does this differ from the trainable weight?
  2. How does the preprocessing of the data work? I took a look at the code but I am confused on how it (such as DIGCNConv.py) relates to Equation 6 in the paper. Are you just pre-computing the k-th order proximity matrices? If so, where is the rest of the digraph convolution?
flyingtango commented 3 years ago
  1. The W in Eq. 6 is the diagonalized weight matrix of P, which is similar to D in Kipf's GCN . The trainable weight is .
  2. Due to the need to compute eigenvalue decomposition, all adjacency matrices including computation and norm are preprocessed, and different adjacency matrices are implemented in get_adj.py. DIGCNConv.py only implements aggregation, which is the same operation as GCNConv. Since Inception is a mature structure, for the sake of convenience, we only release the code for k=2 and three-layer DiGCN model. Calculating the second-order proximation matrix is in get_adj.py, and the convolution operation is in DiGCN_ib.py. You can easily increase the k and the number of layers against the equation.
vymao commented 3 years ago

Thanks, some follow ups:

  1. Is this a valid convolution? I did not see it explained anywhere.
  2. Within InceptionBlock, why is there a Linear layer, and why are there two DiGCN layers?

It would also be enormously helpful if you could document how to properly set up other datasets to run this model on, what to run, etc. I have some that I would be interested in testing (apart from the ones in the paper).

flyingtango commented 3 years ago
  1. For graph conv part (spectral analysis), you can refer to Kipf's paper and PyTorch Geometric implementation. Our paper doesn't make any change in the spectral part, plz see Sec 2.3.
  2. Linear is used for Inception block when k=0, plz refer to Eq. 6. When k=0, there is no adjacency matrix. The node will aggregate features from its self.

Thanks for your suggestion, since our implementation is based on PyTorch Geometric, any data suitable for PG dataloader can be loaded. For details, please refer to the PG documentation. Meanwhile, I will update the README file to give instructions.

vymao commented 3 years ago

I am familiar with the graph convolution paper, but there they motivated it through the graph Fourier transform (eigenvector change of basis, then convolution, then inverse graph Fourier transform, etc.). I'm not sure how the convolution here relates? It seems like Formula 4 of Section 2.3 is implemented for the first order proximity; is the convolution for k >= 2 an approximation of Formula 4?

Also, how should we set the teleport alpha for PageRank a hyperparameter? I will be working with strongly connected graphs, which it seems is noted in the paper that this method isn't ideal for, but I'm also wondering why.

And finally, do you minibatch the training? I didn't see this implemented and I am surprised that the training data can fit into GPU memory.

flyingtango commented 3 years ago
  1. Yes, you are right, the Formula 4 of Section 2.3 is implemented for the first order proximity when k=1. As mentioned in the paper, when k>=2, the kth-order peoximity matrix P^(k) is symmetric due to the intersection operation, which means the adjacency matrix would be undirected graph. There is no need to apply digraph conv. However, after doing intersection, we need to use W to norm the P.
  2. The teleport probability is set according to the situation. For example, in the case of strong connectivity such as air pollution, alpha can be set to be larger. But in the time sequence or knowledge graph, when the direction needs to be strongly restricted, it can be set to a smaller alpha. In addition, you can also refer to the setting of teleport probability in this article. And for your work, assuming that your graph is already strongly connected, and making sure that the graph is aperiodic (adding self-loops), you can directly solve for the Perron Vector of the adjacency matrix and use this vector to norm the adjacency matrix. It is not necessary to use personalized pagerank to enhance the connectivity of the original graph, and you can skip to the next step of solving the Laplacian of the digraph. Our method in this paper is not very suitable for undirected and strongly connected graphs, due to the limitations of SVD. If you graph is still digraph, as far as I am concerned, why not try our approach? :)
  3. Yes, for this implementation, it uses full batch, which is no different from the original GCN. This is a limitation of our approach, which requires loading the entire graph during training and also causes OOM issues when there are too many nodes. You can apply minibatch using ClusterGCN.
vymao commented 3 years ago

In my graph, the strongly connectedness comes from the fact that there exists a directed cycle for any path length, not necessarily just between two nodes, and there are no isolated nodes.

With regards to alpha, there is this statement in the paper:

Likewise, another work [26] also employs this idea to solve digraph problem. However, it is defined on the strongly connected digraphs, which is not universally applicable to any digraphs. Our method can easily generalize to it by α → 0.

Unless I am mistaken, did you mean to say that strongly connected graphs should have smaller alpha? I would still consider using DiGCN (but with particular choice for alpha) for a strongly connected graph I have. For one, it does not seem like the original spectral digraph convolution has been implemented in code. And for two, it seems that the kth-order proximity metric is useful here.

Also, I have a Pytorch Geometric dataset saved in individual Data files (from which the out-of-memory DataLoader loads it into minibatches). Would you be able to incorporate this? As it stands, it seems that one needs a compressed Numpy file.

flyingtango commented 3 years ago
  1. As I answered before, strongly connected graph doesn't need to carry out approximate personalized pagerank, which means you only need to calculate the Perron Vector and norm the adj matrix. There is no alpha in this processing. Work [26] is defined on strongly connected directed graph, which is suitable for your data and you can refer to his work too.
  2. The structure of Inception is complementary to the original GCN, and it is useful for improving the overall model. The choice of K and the number of layers can be determined based on experimental results.
  3. I don't think minibatch can be applied directly to our model, you need to use ClusterGCN to chunk large adjacency matrices to use minibatch. I think this is a data preprocessing problem and you could handle easily :)
vymao commented 3 years ago

I know that, but I am still considering DiGCN for the following reasons:

  1. [26] does not have the kth order proximity metric or the inception blocks, which seems useful.
  2. [26] has not been implemented in code.
  3. It is stated in the paper that if alpha is 0, you recover [26].

Is this valid?

flyingtango commented 3 years ago

Yes, exactly. You can take alpha=0.

vymao commented 3 years ago

Ok thank you. Also, do you add self-edges into the adjacency matrix before you apply the layers?

flyingtango commented 3 years ago

Yes, we add self-loops to guarantee the graph is aperiodic.

vymao commented 3 years ago

Also, isn't Cluster-GCN only available for undirected graphs?

flyingtango commented 3 years ago

Yes, Cluster-GCN is for undirected graphs. However, our digraph Laplacian is symmetric, which means you can apply Cluster-GCN in the Laplacian.