NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Vijay Prakash Dwivedi | Graph Transformer: A Generalization of Transformers to Graphs. #54

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Vijay Prakash Dwivedi. Graph Transformer: A Generalization of Transformers to Graphs.

NorbertZheng commented 1 year ago

Overview

This blog is based on the paper A Generalization of Transformer Networks to Graphs with Xavier Bresson at 2021 AAAI Workshop on Deep Learning on Graphs: Methods and Applications (DLG-AAAI’21). image We present Graph Transformer, a transformer neural network that can operate on arbitrary graphs.

NorbertZheng commented 1 year ago

Background

Lets start with the two keywords, Transformers and Graphs, for a background.

Transformers

Transformers [1] based neural networks are the most successful architectures for representation learning in Natural Language Processing (NLP)

As the core building block of Transformers, there exists the multi-head attention mechanism, represented by this formula: image Fig: The feature update equation for the word $i$ using the multi-head attention mechanism. The symbols represent the usual meanings.

Using multi-head attention, a word attends to each other word in a sentence and combines the received information to generate its abstract feature representations.

Graphs

Graphs are ubiquitous data structures. There is a wide range of application domains where datasets can be represented as graphs. For example, molecular graphs in chemistry, interactions among particles in physics, drug protein interactions in medicine, user and their social and commercial connections in social media, problems in combinatorial optimization, etc.

image Fig: Representative domains with graph-structured datasets. Link

For learning on graphs, graph neural networks (GNNs) have emerged as the most powerful tool in deep learning.

In short, GNNs consist of several parameterized layers, with each layer taking in a graph with node (and edge) features and builds abstract feature representations of nodes (and edges)

The so-generated features are then passed to downstream classification layers, usually MLPs, and the target property is predicted.

NorbertZheng commented 1 year ago

Objective

Now, the target is to generalize transformer neural networks to graphs so that it can learn on graphs and datasets

To proceed with the objective, we focus on extending the key design principles of Transformers from NLP to graphs in general.

NorbertZheng commented 1 year ago

Key Design Aspects for Graph Transformer

We find that attention using graph sparsity and positional encodings are two key design aspects for the generalization of transformers to arbitrary graphs.

Now, we discuss these from the contexts of both NLP and graphs to make the proposed extensions clear for Graph Transformers.

Graph Sparsity

In NLP, Transformers consider full attention while building feature representations for words. That is, a transformer treats a sentence as a fully connected graph of words. This choice of full attention can be justified for two reasons:

First, it is difficult to find meaningful sparse interactions or connections among the words in a sentence. For instance, the dependency of a word in a sentence on another word can vary with context, user’s perspective, and application at hand. There can be numerous plausible ground truth connections among words in a sentence and therefore, text datasets of sentences do not often have explicit word interactions available. It thereby makes sense to perform full attention and let the model decide how the words depend on others.

Second, the so-interpreted fully connected graph in NLP often has less than tens of hundreds of nodes (i.e., sentences are often less than tens or hundreds of words). On this size of graphs, attention to every node is computationally feasible in memory and time.

image Fig: Full attention v/s attention restricted to local neighborhood.

However, in the case of actual graph datasets, graphs have arbitrary connectivity structure available based on the application domain, and have node sizes in ranges of up to millions, or even billions. The available structure presents us with a rich source of information to exploit while learning in the neural network, whereas the node sizes practically makes it impossible to have a fully connected graph for such datasets.

In fact, the local information aggregation is at the core of GNNs, indicative of the fact that sparsity is a good inductive bias for generalization.

Positional Encodings

The attention mechanism in Transformer is invariant to the ordering of the nodes. It does not have any notion of where in the sequence (or the sentence) the word is located. This means that, the Transformer regards a multi-set of words rather than the sequence of words in NLP, as illustrated by the following comparison:

image Fig: For the multi-head attention mechanism, the input can be interpreted to be a multi-set of words.

That would mean losing out some information about the ordering of the words, isn’t it?

To avoid this and making Transformer aware of the sequential information, some kind of positional encoding is necessary in the transformer. The original transformer by Vaswani et al. [1] uses sinusoidal positional encoding that is added to each word’s feature vector at the inputs. This helps encode the necessary prevalent (sequential) relationship among the words into the model.

We extend this critical design block of positional information encoding for Graph Transformer. In fact, a line of research in GNNs [6,7,8] has recently shown that

We, therefore, leverage the success of the recent works on positional information in GNNs and use Laplacian Positional Encodings [8] in Graph Transformer. We use precomputed Laplacian eigenvectors [9] to add to the node features before the first layer, similar to how positional encodings are added in the original Transformer [1].

image Fig: Positional Encodings (PEs) in original Transformer v/s proposed Graph Transformer.

Hence, sparse graph structure during attention and positional encodings at the inputs are the two important things we consider while generalizing transformers to arbitrary graphs.

NorbertZheng commented 1 year ago

May James want to utilize the learned representation to approximate Laplacian PEs?

NorbertZheng commented 1 year ago

Proposed Graph Transformer Architecture

We now present the proposed architecture - the Graph Transformer Layer and the Graph Transformer Layer with edge features. The schematic diagram of a layer shown as follows consist of the major components:

image Compared to the standard transformer of Vaswani et al. [1], the key differences (or the extensions) which generalizes transformer to graphs to result in a Graph Transformer are:

This concludes the description of the proposed Graph Transformer. We refer to the paper for the complete layer update equations.

NorbertZheng commented 1 year ago

Remarks from the experiments

We evaluate Graph Transformer on benchmark datasets and validate our design choices, alongside an attempt to answer some open research questions on:

We remark the following results:

NorbertZheng commented 1 year ago

Code

The tables of the numerical experiments are in the paper, the code implementation is open-sourced on GitHub, and an accompanying video presentation is on YouTube, with the corresponding links as follows:

Paper: https://arxiv.org/abs/2012.09699 GitHub: https://github.com/graphdeeplearning/graphtransformer Video: https://youtu.be/h-_HNeBmaaU?t=240

NorbertZheng commented 1 year ago

Reference

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need.
  2. Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
  3. Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., (2018). Improving language understanding by generative pre-training.
  4. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., (2020). Language models are few-shot learners.
  5. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P. and Bengio, Y. (2018). Graph attention networks.
  6. Srinivasan, B. and Ribeiro, B., (2019). On the equivalence between positional node embeddings and structural graph representations.
  7. You, J., Ying, R. and Leskovec, J., (2019). Position-aware graph neural networks.
  8. Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. (2020). Benchmarking graph neural networks.
  9. Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
  10. Ioffe, S. and Szegedy, C., (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.
  11. Ba, J.L., Kiros, J.R. and Hinton, G.E., (2016). Layer normalization.
  12. Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks.