Graph neural network-inspired kernels for gaussian processes in semi-supervised learning

@article{niu2023graph,
  title={Graph neural network-inspired kernels for gaussian processes in semi-supervised learning},
  author={Niu, Zehao and Anitescu, Mihai and Chen, Jie},
  journal={arXiv preprint arXiv:2302.05828},
  year={2023}
}

Main Idea - The combination of GNNs and GPs

There is a GNN model called Graph Convolutional Networks (GCN). The paper shows that when GCN layers have a large number of nodes, the GCN behaves similarly to a Gaussian Process.
The authors created new kernels (functions that define relationships) for GPs that are inspired by GNNs.

Found a relevant paper that explains wide neural networks behave as GP: Wide Neural Networks as Gaussian Processes: Lessons from Deep Equilibrium Models

GCN as GP

All proofs are given in the appendix of this paper!

Equation (1): GCN Layer Formula

$$ X^{(l)} = \phi\left( A X^{(l-1)} W^{(l)} + b^{(l)} \right), $$

Purpose: Describes how the inputs $X^{(l-1)}$ are transformed in a single GCN layer.
Components:
- $A$: The normalized adjacency matrix of the graph, including self-loops.
- $X^{(l-1)}$: The input feature matrix from the previous layer.
- $W^{(l)}$: The weight matrix for the current layer $l$.
- $b^{(l)}$: The bias term for the layer.
- $\phi$: The activation function, typically ReLU.

Equation (2): Element-wise GCN Transformation

$$ x_i^{(l)}(x) = \phi\left(z_i^{(l)}(x)\right), \quad z_i^{(l)}(x) = bi^{(l)} + \sum{v \in \mathcal{V}} A_{xv} y_i^{(l)}(v), $$

Purpose: Breaks down the transformation into element-wise operations for a node $x$ in layer $l$.
Components:
- $z_i^{(l)}(x)$: Represents the pre-activation value at node $x$ and coordinate $i$.
- $b_i^{(l)}$: The bias term for the $i$-th coordinate.
- $A_{xv}$: Entry in the adjacency matrix indicating the connection between nodes $x$ and $v$.
- $y_i^{(l)}(v)$: The weighted sum at node $v$ for coordinate $i$ in the $l$-th layer.
- $\phi$: The activation function applied element-wise.

Equation (3): Covariance Calculation in GP Interpretation

$$ C^{(l)} = \mathbb{E}_{z_i^{(l)} \sim \mathcal{N}(0, K^{(l)})} \left[\phi(z_i^{(l)}) \phi(z_i^{(l)})^T \right], \quad l = 1, \ldots, L, $$

Purpose: Computes the covariance matrix $C^{(l)}$ at layer $l$ based on the outputs of the activation function.
Components:
- $z_i^{(l)}$: Pre-activation values following a normal distribution with mean zero and covariance $K^{(l)}$.
- $\phi$: The activation function applied to $z_i^{(l)}$.

Equation (4): Covariance Update for the Next Layer

$$ K^{(l+1)} = \sigmab^2 1{N \times N} + \sigma_w^2 A C^{(l)} A^T, \quad l = 0, \ldots, L - 1. $$

Purpose: Recursively updates the covariance matrix for the next layer.
Components:
- $\sigma_b^2$: Variance of the bias term.
- $1_{N \times N}$: Matrix of ones representing the bias contribution.
- $\sigma_w^2$: Variance of the weight term.
- $A$: The adjacency matrix of the graph.
- $C^{(l)}$: Covariance matrix from the current layer.
- $A^T$: Transpose of the adjacency matrix.

Scalable Computation through Low-Rank Approximation

Problem: Computing the full covariance matrix $K^{(L)}$ recursively using Equations (3) and (4) is computationally expensive for large datasets.
Solution: Use a low-rank approximation of $K^{(L)}$ to reduce computational costs. The idea is to approximate $K^{(L)}$ as $\hat{K}^{(L)} = Q^{(L)} Q^{(L)T}$, where $Q^{(L)}$ is a smaller matrix derived using landmark nodes.

Key Approach

Approximation: Start with a low-rank approximation of $C^{(l)}$ at each layer $l$ and use it to update $K^{(l+1)}$.
Algorithm: Algorithm 1 describes how $K^{(L)}$ is approximated using $Q^{(l)}$, which is computed iteratively to maintain efficiency.

Goal

Reduce Complexity: This low-rank approach enables scalable computation by working with smaller matrices, making the GP-based model feasible for large-scale datasets.

Composing Graph Neural Network-Inspired Kernels

Purpose: This section explains how the covariance matrix of the Gaussian Process (GP) model derived from a Graph Convolutional Network (GCN) can be programmatically composed to match different GNN architectures.
Key Idea: Using a programmable procedure, the covariance matrix and its low-rank approximation can be automatically transformed in the same way as the operations in a GNN. This allows creating corresponding kernels for various GNN architectures like GCNII, GIN, and GraphSAGE.
Transformation Process:
- For each GNN operation (like adding bias, multiplying weights, or applying an activation function), a corresponding transformation is applied to the covariance matrix.
- For instance:
- The activation function $g(K)$ is defined as: $g(K) = \mathbb{E}_{z \sim \mathcal{N}(0, K)} \left[\phi(z)\phi(z)^T\right],$ where $\phi$ denotes the activation function.
- The graph convolution step modifies the covariance matrix from $K$ to $AK A^T$.
- Multiplying weights transforms the covariance matrix as: $$K \to \sigma_w^2 K.$$
- Finally, adding the bias term results in: $$K \to K + \sigmab^2 1{N \times N}.$$
- These transformations are composed step-by-step to derive a kernel that aligns with the structure of a specific GNN layer.
Practical Examples:
- The section shows specific compositions for GCNII and GIN layers:
- For GCNII, the updates are given by: $$K \leftarrow (1 - \alpha) Ag(K) A^T + \alpha^2 K^{(0)} \quad \text{and} \quad Q \leftarrow (1 - \alpha) A Q (g(Q Q^T)) \sqrt{\alpha^2}.$$
- For GIN, the covariance update includes: $$K \leftarrow \sigma_w^2 g(B) + \sigmab^2 1{N \times N}, \quad \text{where } B = \sigma_w^2 A g(K) A^T + \sigmab^2 1{N \times N}.$$ The low-rank factor $Q$ is updated similarly based on these operations.
Main Takeaway: By treating the transformations on the covariance matrix in a programmable way, the paper demonstrates how any GNN architecture can be associated with a corresponding GP kernel, making it possible to extend the GP model to new GNN designs.

Experiments

Why was this experiment conducted?

The experiments aim to demonstrate the performance of GP kernels derived by taking limits on the layer width of GCNs and other GNN architectures. The primary goal is to show that these GPs are comparable in prediction accuracy while being significantly faster to compute.

Datasets

The experiments were conducted on multiple benchmark datasets, covering both classification and regression tasks:

Citation Networks: Cora, Citeseer, PubMed, ArXiv.
Community Detection: Reddit.
Traffic Prediction: Chameleon, Squirrel, and Crocodile (using Wikipedia hyperlinks).

Evaluation Setup

Prediction Performance (GCN-Based Comparison):
- The GP kernels used include:
  - GCNGP: Derived from the limiting case of GCN.
  - RBF: A standard squared-exponential kernel.
  - GGP: A Gaussian Process kernel from related literature.
- Each of these kernels has a low-rank version (indicated by -X), which uses a Nyström approximation for scalability.
Performance Metrics:
- Classification: Micro-F1 score.
- Regression: $R^2$ (coefficient of determination).
Experiment 1: Comparing GCNs and GPs:
- The results show that GCNGP performs comparably or slightly better than GCN and other GP kernels. The low-rank version (GCNGP-X) achieves competitive results while being more computationally efficient.
Experiment 2: Comparing with Other GNNs:
- The study extends the comparison to popular GNN architectures such as GCNII, GIN, and GraphSAGE.
- The results show that the GP versions of these architectures perform similarly to the GNNs, but with efficiency advantages in specific cases like PubMed and Reddit.

Running Time and Scalability

Running Time Comparison:
- The comparison shows that GCNGP-X is generally faster than GCN, with some datasets showing a speedup by one to two orders of magnitude. This is illustrated in Figure 1.
Scalability Analysis:
- Figure 2 demonstrates that the running time scales approximately linearly with respect to the graph size ($M + N$) for GCNGP-X and GCN. However, GCNGP shows cubic scaling, making the low-rank approximation more practical for large graphs.
Analysis on the Depth:
- The study examines the impact of increasing the number of layers. The results (Figure 3) indicate that both GCN and GCNII suffer from oversmoothing at larger depths. However, their GP counterparts (e.g., GCNGP) remain stable even for a depth as large as 12.
Analysis on the Landmark Set:
- The number of landmark nodes ($N_a$) affects the trade-off between approximation quality and running time. Figure 4 shows that using only $1/800$ of the training set as landmarks achieves accuracy comparable to GCN, while the computational cost remains much lower.

Results

The experiments confirm that GCNGP-X performs competitively with GCNs in both accuracy and running time. The low-rank approximations enable scalability without sacrificing performance, making them viable for large-scale applications.
The depth analysis also suggests that the GP-based models are less prone to oversmoothing, maintaining stable accuracy across multiple layers.

Key Takeaway

The experiments show that GP kernels derived from GCNs can achieve comparable prediction performance to GNNs while being computationally more efficient. This demonstrates the practicality of the proposed GP approach for large-scale and deep graph-based tasks.

I have reviewed this paper but found it challenging to fully comprehend the mathematical proofs and rigorous details presented in the equations. If anyone could provide insights or comments, I would greatly appreciate it :)

I have reviewed this paper but found it challenging to fully comprehend the mathematical proofs and rigorous details presented in the equations. If anyone could provide insights or comments, I would greatly appreciate it :)

to fully comprehend the mathematical proofs --> as you clearly mentioned in https://github.com/AIML-K/GNN_Survey/issues/7#issuecomment-2430977827 , understanding NN and GP equivalence will help you understanding this. In particular, paper on infinitely-wide NN equivalent to GP will be useful.

Don't go to the first source (Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996. -- too statistical). I recommend https://arxiv.org/abs/1711.00165 or https://openreview.net/pdf?id=rkl4aESeUH .

Meanwhile, your decision to attend GP workshop will pay off.

AIML-K / GNN_Survey