Graph Convolutional Networks.

NorbertZheng commented 2 years ago

NorbertZheng commented 2 years ago

Multi-layer Graph Convolutional Network (GCN) with first-order filters. gcn_web

NorbertZheng commented 2 years ago

Related Reference

Bruna J, Zaremba W, Szlam A, et al. Spectral networks and locally connected networks on graphs.
Henaff M, Bruna J, LeCun Y. Deep convolutional networks on graph-structured data.
Duvenaud D K, Maclaurin D, Iparraguirre J, et al. Convolutional networks on graphs for learning molecular fingerprints.
Li Y, Tarlow D, Brockschmidt M, et al. Gated graph sequence neural networks.
Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering.
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks.
Ferenc Huszar. How powerful are Graph Convolutions.
Jain A, Zamir A R, Savarese S, et al. Structural-rnn: Deep learning on spatio-temporal graphs.
Brandes U, Delling D, Gaertler M, et al. On modularity clustering.
Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations.
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks.

NorbertZheng commented 2 years ago

Overview

Many important real-world datasets come in the form of graphs or networks: social networks, knowledge graphs, protein-interaction networks, the World Wide Web, etc. (just to name a few). Yet, until recently, very little attention has been devoted to

the generalization of neural network models to such structured datasets.

In the last couple of years, a number of papers re-visited this problem of generalizing neural networks to work on arbitrarily structured graphs (Bruna et al., ICLR 2014; Henaff et al., 2015; Duvenaud et al., NIPS 2015; Li et al., ICLR 2016; Defferrard et al., NIPS 2016; Kipf & Welling, ICLR 2017), some of them now achieving very promising results in domains that have previously been dominated by, e.g.,

kernel-based methods,
graph-based regularization techniques,
...

In this post, I will give a brief overview of recent developments in this field and point out the strengths and drawbacks of various approaches. The discussion here will mainly focus on two recent papers:

Kipf & Welling (ICLR 2017), Semi-Supervised Classification with Graph Convolutional Networks.
Defferrard et al. (NIPS 2016), Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering.

and a review/discussion post by Ferenc Huszar: How powerful are Graph Convolutions? that discusses some limitations of these kinds of models. I wrote a short comment on Ferenc's review at the very end of this post.

NorbertZheng commented 2 years ago

Outline

Short introduction to neural network models on graphs.
Spectral graph convolutions and Graph Convolution Networks (GCNs).
Demo: Graph embeddings with a simple 1st-order GCN model.
GCNs as a differentiable generalization of the Weisfeiler-Lehman algorithm.

NorbertZheng commented 2 years ago

Recent literature

Generalizing well-established neural models like RNNs or CNNs to work on arbitrarily structured graphs is a challenging problem.

Some recent papers introduce problem-specific specialized architectures (e.g. Duvenaud et al., NIPS 2015; Li et al., ICLR 2016; Jain et al., CVPR 2016),
Others make use of graph convolutions known from spectral graph theory (Bruna et al., ICLR 2014; Henaff et al., 2015) to define parameterized filters that are used in a multi-layer neural network model, akin to "classical" CNNs that we know and love.

A spectral graph convolution is defined as

the multiplication of a signal with a filter in the Fourier space of a graph.

A graph Fourier transform is defined as the multiplication of a graph signal $X$ (i.e. feature vectors for every node) with the eigenvector matrix $U$ of the graph Laplacian $L$. The (normalized) graph Laplacian can be easily computed from the symmetrically normalized graph adjacency matrix $\tilde{A}$.

$$ L=I-\tilde{A} $$

NorbertZheng commented 2 years ago

Using a spectral approach comes at a price:

Filters have to be defined in Fourier space and a graph Fourier transform is expensive to compute.

It requires the multiplication of node features with the eigenvector matrix of the graph Laplacian, which is $O(N^{2})$ operation for a graph with $N$ nodes; computing the eigenvector matrix in the first place is even more expensive.

More recent work focuses on bridging the gap between fast heuristics and the slow, but somewhat more principled, spectral approach. Defferrard et al. (NIPS 2016) approximate smooth filters in the spectral domain using Chebyshev polynomials with free parameters that are learned in a neural network-like model. They achieve convincing results on regular domains (like MNIST), closely approaching those of a simple 2D CNN model.

In Kipf & Welling (ICLR 2017), we take a somewhat similar approach and start from the framework of spectral graph convolutions, yet introduce simplifications (we will get to those later in the post) that in many cases allow both for significantly faster training times and higher predictive accuracy, reaching state-of-the-art classification results on a number of benchmark graph datasets.

NorbertZheng commented 2 years ago

Definitions

Currently, most neural network models have a somewhat universal architecture in common. I will refer to these models as Graph Convolution Networks (GCNs); convolutional, because filter parameters are typically shared over all locations in the graph (or a subset thereof as in Duvenaud et al., NIPS 2015).

For these models, the goal is then to learn a function of signals/features on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ which takes as input:

A feature description $x_i$ for every node $i$; summarized in a $N \times D$ feature matrix $X$ ($N$: number of nodes, $D$: number of input features).
A representative description of the graph structure in matrix form; typically in the form of an adjacency $A$ (or some function thereof).

and produces a node-level output $Z$ (an $N \times F$ feature matrix, where $F$ is the number of output features per node). Graph-level outputs can be modeled by introducing some form of pooling operation (see, e.g. Duvenaud et al., NIPS 2015).

Every neural network layer can then be written as a non-linear function

$$ H^{(l+1)}=f(H^{(l)},A) $$

with $H^{(0)}=X$ and $H^{(L)}=Z$ (or $z$ for graph-level outputs), $L$ being the number of layers. The specific models then differ only in how $f(\cdot,\cdot)$ is chosen and parameterized.

NorbertZheng commented 2 years ago

TEM can be viewed as 1-layer GCN,

$$ Z=f(X,A) $$

, where $X$ is the sensory observation, and $A$ is an action-specified adjacency matrix (which is a component of the whole adjacency matrix).

NorbertZheng commented 2 years ago

A simple example

As an example, let's consider the following very simple form of a layer-wise propagation rule:

$$ f(H^{(l)},A)=\sigma\left(AH^{(l)}W^{(l)}\right) $$

, where $W^{(l)}$ is a weight matrix for the $l$-th neural network layer and $\sigma(\cdot)$ is a non-linear activation function like the RELU. Despite its simplicity, this model is already quite powerful (we'll come to that in a moment).

But first, let us address two limitations of this simple model:

The multiplication with $A$ means that, for every node, we sum up all the feature vectors of all neighboring nodes but not the node itself (unless there are self-loops in the graph, just like TEM, but we use an action-specified adjacency matrix). We can "fix" this by enforcing self-loops in the graph: we simply add the identity matrix to $A$.
$A$ is typically not normalized and therefore the multiplication with $A$ will completely change the scale of the feature vectors (we can understand that by looking at the eigenvalues of $A$). Normalizing $A$ such that all rows sum to one, i.e. $D^{-1}A$, where $D$ is the diagonal node degree matrix, gets rid of this problem. TEM already solves this problem, the transition matrix is normalized in the _env class.

NorbertZheng commented 2 years ago

Multiplying with $D^{-1}A$ now corresponds to taking the average of neighboring node features. In practice, dynamics get more interesting when we use a symmetric normalization, i.e. $D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ (as this no longer amounts to mere averaging of neighboring nodes). Combining these two tricks, we essentially arrive at the propagation rule introduced in Kipf & Welling (ICLR 2017):

$$ f(H^{(l)},A)=\sigma\left(\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}\right) $$

, with $\hat{A}=A+I$, where $I$ is the identity matrix and $\hat{D}$ is the diagonal node degree matrix of $\hat{A}$.

In the next section, we will take a closer look at how this type of model operates on a very simple example graph: Zachary's karate club network (make sure to check out the Wikipedia article!).

NorbertZheng commented 2 years ago

Embedding the karate club network

Karate club graph, colors denote communities obtained via modularity-based clustering (Brandes et al., 2008).

Let's take a look at how our simple GCN model (see previous section or Kipf & Welling, ICLR 2017) works on a well-known graph dataset: Zachary's karate club network (see Figure above).

We take a 3-layer GCN with randomly initialized weights. Now, even before training the weights, we simply insert the adjacency matrix of the graph and $X=I$ (i.e. the identity matrix, as we don't have any node features) into the model. This is exactly what TEM defines the sensory observations, no node features at all! The 3-layer GCN now performs three propagation steps during the forward pass and effectively convolves the 3rd-order neighborhood of every node (all nodes up to 3 "hops" away). In TEM, there is only 1-layer action-specified RNN (although there is no direct one-hot mapping between the node in RNN and the node in abstract structure, we can still assume there is a distributed mapping), so TEM can only capture 1st-order neighborhood of every node.

Remarkably, the model produces an embedding of these nodes that closely resembles the community-structure of the graph (see Figure below). Remember that we have initialized the weights completely at random and have not yet performed any training updates (so far)!

GCN embedding (with random weights) for nodes in the karate club network.

This might seem somewhat surprising. A recent paper on a model called DeepWalk (Perozzi et al., KDD 2014) showed that they can learn a very similar embedding in a complicated unsupervised training procedure. How is it possible to get such an embedding more or less "for free" using our simple untrained GCN model?

NorbertZheng commented 2 years ago

We can shed some light on this by interpreting the GCN model as a generalized, differentiable version of the well-known Weisfeiler-Lehman algorithm on graphs. The (1-dimensional) Weisfeiler-Lehman algorithm works as follows (I'm simplifying things here. For an extensive mathematical discussion of the Weisfeiler-Lehman algorithm, have a look at the paper by Douglas (2011)):

For all nodes $v_i \in \mathcal{G}$:

Get features { $h{v{j}}$ } of neighboring nodes { $v_{j}$ }. Node features for the Weisfeiler-Lehman algorithm are typically chosen as scalar integers and are often referred to as colors.
Update node feature $h{v{j}} \leftarrow hash\left(\Sigma{j} h{v_{j}}\right)$, where $hash(\cdot)$ is (ideally) an injective hash function.

Repeat for $k$ steps or until convergence.

In practice, the Weisfeiler-Lehman algorithm assigns a unique set of features for most graphs. This means that every node is assigned a feature that uniquely describes its role in the graph. Exceptions are highly regular graphs like grids, chains, etc. For most irregular graphs, this feature assignment can be used as a check for graph isomorphism (i.e. whether two graphs are identical, up to a permutation of the nodes).

NorbertZheng commented 2 years ago

Going back to our Graph Convolutional layer-wise propagation rule (now in vector form):

$$ h{v{i}}^{(l+1)}=\sigma\left(\Sigma{j}\frac{1}{c{ij}}h{v{j}}^{(l)}W^{(l)}\right) $$

, where $j$ indexes the neighboring nodes of $v{j}$. $c{ij}$ is a normalization constant for the edge $(v{i},v{j})$ which originates from using the symmetrically normalized adjacency matrix $D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ in our GCN model. We now see that this propagation rule can be interpreted as

a differentiable and parameterized (with $W^{(l)}$) variant of the hash function used in the original Weisfeiler-Lehman algorithm.

If we now choose an appropriate non-linearity and initialize the random weight matrix such that it is orthogonal (or e.g. using the initialization from Glorot & Bengio, AISTATS 2010), this update rule becomes stable in practice (also thanks to the normalization with $c_{ij}$). And we make the remarkable observation that we get meaningful smooth embeddings where we can interpret distance as (dis-)similarity of local graph structures!

NorbertZheng commented 2 years ago

Semi-supervised learning

Since everything in our model is differentiable and parameterized, we can add some labels, train the model and observe how the embeddings react. We can use the semi-supervised learning algorithm for GCNs introduced in Kipf & Welling (ICLR 2017). We simply label one node per class/community (highlighted nodes in the video below) and start training for a couple of iterations. As we don't anneal the learning rate in this example, things become quite "jiggly" once the model has converged to a good solution.

Semi-supervised classification with GCNs: Latent space dynamics for 300 training iterations with a single label per class. Labeled nodes are highlighted. video

Note that the model directly produces a 2-dimensional latent space which we can immediately visualize. We observe that the 3-layer GCN model manages to linearly separate the communities, given only one labeled example per class. This is a somewhat remarkable result, given that the model received no feature description of the nodes.

At the same time, initial node features could be provided, which is exactly what we do in the experiments described in our paper (Kipf & Welling, ICLR 2017) to achieve state-of-the-art classification results on a number of graph datasets.

NorbertZheng commented 2 years ago

Conclusion

Research on this topic is just getting started. The past several months have seen exciting developments, but we have probably only scratched the surface of these types of models so far. It remains to be seen how neural networks on graphs can be further taylored to specific types of problems, like, e.g., learning on directed or relational graphs, and how one can use learned graph embeddings for further tasks down the line, etc. This list is by no means exhaustive and I expect further interesting applications and extensions to pop up in the near future. Let me know in the comments below if you have some exciting ideas or questions to share!

NorbertZheng commented 2 years ago

Source Code

We have released the code for Graph Convolutional Networks on GitHub.

NorbertZheng commented 2 years ago

How powerful are Graph Convolutions? (review of Kipf & Welling, 2016)

NorbertZheng commented 2 years ago

Overview

This post is about a paper that has just come out recently on practical generalizations of convolutional layers to graphs:

Thomas N. Kipf and Max Welling (2016) Semi-Supervised Classification with Graph Convolutional Networks.

Along the way I found this earlier, related paper:

Defferrard, Bresson and Vandergheynst (NIPS 2016) Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering.

This post is mainly a review of (Kipf and Welling, 2016). The paper is nice to read, and while I like the general idea, I feel like the approximations made in the paper are too limiting and severely hurt the generality of the models we can build. This post explains why. Since writing this, Thomas also wrote a blog post addressing some of my comments, which I highly recommend.

NorbertZheng commented 2 years ago

Summary of this post

Graph convolutions are generalisation of convolutions, and easiest to define in spectral domain.
General Fourier transform scales poorly with size of data so we need relaxations.
(Kipf and Welling) use first order approximation in Fourier-domain to obtain an efficient linear-time graph-CNNs.
I illustrate here what this first-order approximation amounts to on a 2D lattice one would normally use for image processing, where actual spatial convolutions are easy to compute.
In this application the modelling power of the proposed graph convolutional networks is severely impoverished, due to the first-order and other approximations made.

NorbertZheng commented 2 years ago

The general graph-based learning problem

The graph learning problem is formulated as follows:

We are given a set of nodes, each with some observed numeric attributes $x_i$.
For each node, we'd like to predict an output or label $y_i$. We observe these labels for some, but not all, of the nodes.
We are also given a set of weighted edges, summarized by an adjacency matrix $A$. The main assumption is that when predicting the output $y_i$ for node $i$, the attributes ( $x$ ) and connectivity ( $g$ ) of nearby nodes provide useful side information or additional context.

Kipf & Welling use graph convolutional neural networks to solve this problem. A good way to imagine what's happening is to consider a neural network that receives

input features $x_j$ from all nodes $j$ in the local neighborhood around a node $i$

, and outputs an estimate of the associated label $y_i$. The information from the local neighborhood gets combined over the layers via a concept of graph convolutions.

The deeper the network, the larger the local neighborhood - you can think of it as the generalization of the receptive field of a neuron in a normal CNN. This network is applied convolutionally across the entire graph, always receiving features from the relevant neighborhood around each node.

NorbertZheng commented 2 years ago

How are convolutions on a graph defined?

Usually, when we talk about convolutions in machine learning we consider time series, 2D images, occasionally 3D tensors. You can understand these as special cases of the graph learning problem, where the graph is a regular line, 2D square or 3D cube lattice

with regular connectivity between neighboring pixels or tensor entries.

These are all examples or regular graphs where all nodes have the same neighborhood connectivity patterns. It's not immediately obvious/intuitive how to generalize the concept of convolutions to arbitrary graphs that can have nodes

with very different local connectivity patterns or different degrees.

To generalize the concept, we need to adopt a spectral view of convolutions, and consider what convolutions are in the Fourier domain:

simple pointwise multiplication of the Fourier transform of a signal.

Fourier transforms do generalize to graphs, therefore we can define a general concept of graph convolutions as pointwise multiplication of the spectra of signals in the Fourier domain.

This, however, is a very impractical view: computing the spectrum of signal over general graphs involves matrix diagonalization, which generally has cubic complexity in the number of nodes. Computing exact convolutions over graphs is thus computationally intensive.

NorbertZheng commented 2 years ago

Approximate graph convolutions

One way to make graph convolutions work is by limiting the convolution kernels under consideration. For example, if the conv. kernel can be expressed or approximated using Chebyshev polynomials of eigenvalues in the spectral domain, computing the convolution becomes easier. This is the approach proposed earlier this year by Defferrard et al., NIPS 2016.

Like Taylor expansion, this Chebyshev approximation can be made arbitrarily accurate by making $k$ higher.

Kipf & Welling also use this trick, but go even further and only use a $1^{st}$ order approximation. In the Fourier domain, this restricts convolutions to kernels whose spectrum is an affine function of eigenvalues.

This makes computations pretty fast and easy, and the authors show competitive results on a bunch of graph-learning datasets. But what is really the price we pay when we use this first-order approximation? This is what I'll try to illustrate in this post.

NorbertZheng commented 2 years ago

Special case: dense prediction in computer vision

I think it's intuitive to consider the behavior of the algorithm in a well-studied special case: dense prediction in image processing, such as binary segmentation. Here

the node values $x_i$ are RGB, YUV or for simplicity just grayscale pixel intensities.
the output per node $y_i$ can be a binary or discrete label in semantic segmentation or pixel intensities in image denoising or stylization.
the graph is defined over a regular 2D square lattice of 8-connected pixels as shown below.
let's assume the graph has 4-way rotational symmetry, so all horizontal and vertical edges share the same weight $\alpha$ and all diagonal edges have weight $\beta$.
let's further assume the lattice is infinitely large just so we don't have to worry about what happens at borders.

In this application, the graph convolution network model from Kipf & Welling equates to a vanilla multi-layer convolution neural network, with only convolutional layers, no pooling or anything else. So far, this is good news, as these CNNs can actually solve a lot of dense prediction problems just fine, even though additional operations such as pooling-deconvolution generally can help a lot, such as in hourglass-shaped models.

NorbertZheng commented 2 years ago

However, the approximation employed in the Kipf & Welling model severely restricts the convolution filters available to us. At all levels in this network, the filters are limited to $3 \times 3$ in size (which in itself is still OK) and are also essentially

fixed to be the same kernel across all layers and all units in entire network,

up to a constant multiplier!

If we use the two-parameter approximation, the kernel basically becomes a center-surround pattern like the one pictured above, mathematically, something like this:

$$ \left[ \begin{array}{c} \theta{1}c{1} & \theta{1}c{2} & \theta{1}c{1} \ \theta{1}c{2} & \theta{0}+\theta{1}c{3} & \theta{1}c{2} \ \theta{1}c{1} & \theta{1}c{2} & \theta{1}c_{1} \ \end{array} \right] $$

, where $c{1}$, $c{2}$ and $c{3}$ are fixed constants that depend on the graph weight parameters $\alpha$ and $\beta$ only. The only trainable parameters are $\theta{0}$ and $\theta{1}$, and in the final version, the authors even further fix $\theta{0}=-\theta{1}$ so that the kernel only has a single scalar trainable parameter - essentially a scalar multiplier. Details on how $c{1}$, $c{2}$ and $c{3}$ are computed aside, this model is in fact very, very limited.

Imagine trying to solve something like image segmentation with a CNN where only a particular, fixed centre-surround receptive field is allowed - pretty much impossible. I don't think it's even possible to learn edge-detectors, or Gabor-like filters, or anything sophisticated, even with a multilayer network and lots of nonlinearities.

NorbertZheng commented 2 years ago

Summary

I'm not saying that the model proposed by Kipf & Welling cannot work well in some practical situations, indeed they show that it achieves competitive results in a number of benchmark problems, Also, their approach applies generally to graphs, of which regular 2D-3D lattices are a very unlikely special case, so it is to be expected that

models developed for a more general problem will have less power.

However, the 2D lattice example highlights, the generality and flexibility of this model is, at least in its present form, seriously lower than the CNNs we are used to in image processing. CNNs are a kind of super-weapon, a dependable workhorse when you need to solve pretty much any kind of problem. When you read graph convolutional networks it sounds like we now have a similarly powerful mega weapon to deal with graphs. Unfortunately, this may not be the case yet, but these papers certainly point in that very interesting direction.

From the perspective of generality, the more general treatment of Defferrard et al (2016) looks more promising, even with the added computational burden of including higher order Chebyshev polynomials in the approximation.

NorbertZheng commented 2 years ago

The issue with regular graphs

In the following, I will briefly comment on the statements made in How powerful are Graph Convolutions?, a recent blog post by Ferenc Huszar that provides a slightly negative view on some of the models discussed here.

Ferenc considers the special case of regular graphs. He correctly points out that Graph Convolutional Networks (as introduced in this blog post) reduce to rather trivial operations on regular graphs when compared to models that are specifically designed for this domain (like "classical" 2D CNNs for images).

It is indeed important to note that current graph neural network models that apply to arbitrarily structured graphs typically share some form of shortcoming when applied to regular graphs, like

grids,
chains,
fully-connected graphs
...

A localized spectral treatment (like in Defferrard et al., NIPS 2016), for example, reduces to rotationally symmetric filters and can never imitate the operation of a "classical" 2D CNN on a grid (exluding border-effects). In the same way, the Weisfeiler-Lehman algorithm will not converge on regular graphs. What this tells us, is that we should probably look beyond regular grids when trying to evaluate the usefulness of a specific graph neural network model, as there are specific trade-offs that have to be made when designing such models for arbitary graphs (yet it is of course important to make people aware of these trade-offs) - that is, unless we can come up with a universally powerful model at some point, of course.

NorbertZheng / read-papers