Closed NorbertZheng closed 1 year ago
This blog is based on the paper A Generalization of Transformer Networks to Graphs with Xavier Bresson at 2021 AAAI Workshop on Deep Learning on Graphs: Methods and Applications (DLG-AAAI’21).
We present Graph Transformer, a transformer neural network that can operate on arbitrary graphs.
Lets start with the two keywords, Transformers and Graphs, for a background.
Transformers [1] based neural networks are the most successful architectures for representation learning in Natural Language Processing (NLP)
As the core building block of Transformers, there exists the multi-head attention mechanism, represented by this formula:
Fig: The feature update equation for the word $i$ using the multi-head attention mechanism. The symbols represent the usual meanings.
Using multi-head attention, a word attends to each other word in a sentence and combines the received information to generate its abstract feature representations.
Graphs are ubiquitous data structures. There is a wide range of application domains where datasets can be represented as graphs. For example, molecular graphs in chemistry, interactions among particles in physics, drug protein interactions in medicine, user and their social and commercial connections in social media, problems in combinatorial optimization, etc.
Fig: Representative domains with graph-structured datasets. Link
For learning on graphs, graph neural networks (GNNs) have emerged as the most powerful tool in deep learning.
In short, GNNs consist of several parameterized layers, with each layer taking in a graph with node (and edge) features and builds abstract feature representations of nodes (and edges)
The so-generated features are then passed to downstream classification layers, usually MLPs, and the target property is predicted.
Now, the target is to generalize transformer neural networks to graphs so that it can learn on graphs and datasets
To proceed with the objective, we focus on extending the key design principles of Transformers from NLP to graphs in general.
We find that attention using graph sparsity and positional encodings are two key design aspects for the generalization of transformers to arbitrary graphs.
Now, we discuss these from the contexts of both NLP and graphs to make the proposed extensions clear for Graph Transformers.
In NLP, Transformers consider full attention while building feature representations for words. That is, a transformer treats a sentence as a fully connected graph of words. This choice of full attention can be justified for two reasons:
First, it is difficult to find meaningful sparse interactions or connections among the words in a sentence. For instance, the dependency of a word in a sentence on another word can vary with context, user’s perspective, and application at hand. There can be numerous plausible ground truth connections among words in a sentence and therefore, text datasets of sentences do not often have explicit word interactions available. It thereby makes sense to perform full attention and let the model decide how the words depend on others.
Second, the so-interpreted fully connected graph in NLP often has less than tens of hundreds of nodes (i.e., sentences are often less than tens or hundreds of words). On this size of graphs, attention to every node is computationally feasible in memory and time.
Fig: Full attention v/s attention restricted to local neighborhood.
However, in the case of actual graph datasets, graphs have arbitrary connectivity structure available based on the application domain, and have node sizes in ranges of up to millions, or even billions. The available structure presents us with a rich source of information to exploit while learning in the neural network, whereas the node sizes practically makes it impossible to have a fully connected graph for such datasets.
In fact, the local information aggregation is at the core of GNNs, indicative of the fact that sparsity is a good inductive bias for generalization.
The attention mechanism in Transformer is invariant to the ordering of the nodes. It does not have any notion of where in the sequence (or the sentence) the word is located. This means that, the Transformer regards a multi-set of words rather than the sequence of words in NLP, as illustrated by the following comparison:
Fig: For the multi-head attention mechanism, the input can be interpreted to be a multi-set of words.
That would mean losing out some information about the ordering of the words, isn’t it?
To avoid this and making Transformer aware of the sequential information, some kind of positional encoding is necessary in the transformer. The original transformer by Vaswani et al. [1] uses sinusoidal positional encoding that is added to each word’s feature vector at the inputs. This helps encode the necessary prevalent (sequential) relationship among the words into the model.
We extend this critical design block of positional information encoding for Graph Transformer. In fact, a line of research in GNNs [6,7,8] has recently shown that
We, therefore, leverage the success of the recent works on positional information in GNNs and use Laplacian Positional Encodings [8] in Graph Transformer. We use precomputed Laplacian eigenvectors [9] to add to the node features before the first layer, similar to how positional encodings are added in the original Transformer [1].
Fig: Positional Encodings (PEs) in original Transformer v/s proposed Graph Transformer.
Hence, sparse graph structure during attention and positional encodings at the inputs are the two important things we consider while generalizing transformers to arbitrary graphs.
May James want to utilize the learned representation to approximate Laplacian PEs?
We now present the proposed architecture - the Graph Transformer Layer and the Graph Transformer Layer with edge features. The schematic diagram of a layer shown as follows consist of the major components:
Compared to the standard transformer of Vaswani et al. [1], the key differences (or the extensions) which generalizes transformer to graphs to result in a Graph Transformer are:
This concludes the description of the proposed Graph Transformer. We refer to the paper for the complete layer update equations.
We evaluate Graph Transformer on benchmark datasets and validate our design choices, alongside an attempt to answer some open research questions on:
We remark the following results:
Among several combination of design choices on attention, use of PEs, normalization candidate, etc., the architecture using
has the best performance across all datasets used for evaluation. This empirically validates the choice of using these components for the targeted generalization of transformers.
Since Laplacian PEs have desirable properties of having
it fares as a suitable PE candidate to be used as Graph Transformer, even empirically, as compared to the PE candidates used in the literature for research on Graph Transformer (the related works are discussed in detail in the paper).
The tables of the numerical experiments are in the paper, the code implementation is open-sourced on GitHub, and an accompanying video presentation is on YouTube, with the corresponding links as follows:
Paper: https://arxiv.org/abs/2012.09699 GitHub: https://github.com/graphdeeplearning/graphtransformer Video: https://youtu.be/h-_HNeBmaaU?t=240
Vijay Prakash Dwivedi. Graph Transformer: A Generalization of Transformers to Graphs.