NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Stefano Bosisio | Graph Neural Networks: A learning journey since 2008 -- Graph Attention Networks. #49

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Stefano Bosisio. Graph Neural Networks: A learning journey since 2008 — Graph Attention Networks.

NorbertZheng commented 1 year ago

Overview

My previous posts about graphs and M

NorbertZheng commented 1 year ago

Welcome back to my series on Graph Neural Networks. Today I am going to introduce you to the basic theory underneath one of the greatest Graph Neural Network frameworks: the Graph Attention Network (GAT)[1] published in 2018 by Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio (you should get goosebumps just hearing those names!) What’s new in this ML-graph approach? Let’s analyze what we saw in the previous episodes.

Initially, it was Graph Neural Networks (GNNs) by Gori and Scarselli [5–9] where recursive neural networks were generalized to directly interact with graph objects. GNNs are based on the Banach fixed point theorem. Once the equilibrium is reached, a neural network was run to return an output.

Another approach we saw is Deep Walk [6] by Perozzi in 2014 and its evolution in 2017 [7]. Deep Walk marks the concept of node-embeddings, namely a latent representation (or social representation ) of nodes obtained with SkipGram [8–12]. This model works in a semi-unsupervised mode, although extendability and representation learning may be still problematic.

Thus, further Graph Neural Networks frameworks have been proposed. Spectral approaches [13–17], where graph Laplacian representation, eigenvalues decomposition, and convolution are the key ingredients. Convolution can be further exploited through the diffusion concept. In this approach [18–22], the algorithm learns a specific weight matrix for each node, computing the powers of a transition matrix to describe a node’s neighborhood.

Thus, once we’ve trained our model to a specific graph we cannot use this model to run predictions on a different structure graph.

In GAT authors are emphasizing the power of attention [23–28]. Attention algorithm comes directly from the neural machine translation problem or sequence to sequence (Seq2Seq). If I am translating from French to English I will have to consider the difference in length between input and output sequences, as well as what is the best term I can use given a vocabulary and target sequence. Attention can not only fix the problem of different lengths but also detect the most relevant part of the input to optimize and maximize the output probability given the input sequence.

Let’s jump now in how attention works and how everything has started.

NorbertZheng commented 1 year ago

Attention: what? who? where? when? why?

Attention algorithm has its origins in Cho’s RNN Encoder-Decoder model [14]. The authors were looking for an efficient way to improve ML models for seq2seq translation.

The encoder encodes a sequence of symbols into a fixed-length vector representation, while the decoder decodes the given representation in another symbol paradigm.

Fig.1 shows how Cho’s algorithm works. Given an input sequence $x$ (e.g. Je m'appelle S) the model tries to maximize the conditional log-likelihood for a target sequence $y$:

$$ \max{\theta}\frac{1}{N}\sum{n=1}{N}logp{\theta}(y{n}|x_{n}). $$

The input sequence is encoded with an embedding layer of size hidden_size and then processed in a GRU unit. The hidden state of GRU, called context_vector, is given, together with the target sequence, to the decoder. The decoder projects the target sequence to an embedding space, along with the context vector. Finally, a GRU and a linear layer create a mapping between $x$ and $y$.

image Fig.1: Logical scheme on Cho’s RNN encoder-decoder model. An input sequence is digested by the encoder. The encoder’s hidden state is used as input for the decoder, along with the target sequence, in order to find the right mapping from sequence to sequence.

NorbertZheng commented 1 year ago

Here, we give an example code on how to code up Cho's encoder and decoder in Pytorch. One key difference is that the Decoder class will map the input in the embedded target space.

# Encoder
class EncoderRNN(nn.Module):
    r"Class for the Encoder in the RNN Cho's model."""
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)
# Decoder
class DecoderRNN(nn.Module):
    r""" Decoder class"""
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Example code for Encoder and Decoder for Cho’s RNN model. Both models rely on an embedding layer and GRU.

NorbertZheng commented 1 year ago

A further visual help comes from Keras, where we can code this model up in a simple way, to show how the input info are processed:

from keras.models import Model
from keras.layers import Input, GRU, Dense

# ENCODER
# num_encoder_tokens is the number of tokens we have from input language X
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = GRU(latent_dim, return_state=True)
encoder_outputs, state_h = encoder(encoder_inputs)

# DECODER
# num_decoder_tokens are the output language Y tokens
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_gru = GRU(latent_dim, return_sequences=True)
# here the initial state is the hidden state from the encoder 
decoder_outputs = decoder_gru(decoder_inputs, initial_state=state_h)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

Keras implementation of Cho’s RNN encoder-decoder. Here you can further see how the attention is triggered when passing the encoder GRU’s hidden state to the decoder.

NorbertZheng commented 1 year ago

Later in 2016, Bahdanau extended Cho’s model [26, 28], structuring the attention algorithm with an alignment score. The alignment score further enhances the encoder’s output, defining how well it goes with a given output*:

$$ e{ij}=a(s{i},h_{j}). $$

$$ e{11}=a(0, h{1}) \quad e{12}=a(0,h{2}) \quad e{13}=a(0,h{3}). $$

$$ \alpha{ij}=\frac{exp(e{ij})}{\Sigma{k=1}^{T}exp(e{ik})}. $$

$$ c{i}=\Sigma{j=1}^{T}\alpha{ij}h{j}. $$

NorbertZheng commented 1 year ago

This scheme can be easily applied in PyTorch to Cho’s model, by adding an attention layer, made up of a feedforward neural network, a softmax and a matrix multiplication between the attention weights/alignment score and the encoder output:

# Encoder
class EncoderRNN(nn.Module):
    r"Class for the Encoder in the RNN Cho's model."""
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)
# Decoder
class DecoderRNN(nn.Module):
    r""" Decoder class"""
    def __init__(self, hidden_size, output_size, max_length):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.max_length = max_length

        self.embedding = nn.Embedding(output_size, hidden_size)
        # attention layers
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)

        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        # apply attention
        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))
        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)
NorbertZheng commented 1 year ago

To summarise, Bahdanau’s approach aims to pass to the RNN decoder all the encoder’s hidden state and applies a further step, namely the attention/alignment score layer to the decoder architecture. From here we can start diving into the Graph Attention Network algorithm.

NorbertZheng commented 1 year ago

Graph Attention Network (GAT)

From Bahdanau’s attention approach, Velickovic ́ et al. started to kick in the graph attention network framework — I must underline that however, GAT is agnostic to the particular choice of the attention mechanism. Now, let’s consider a graph of $N$ nodes, each node has $F$ features. The overall process of the attention mechanism here is to ingest a set of node features $h$ (of dimension $F$) and to return an output representation of these features $h'$ (whose dimension is $F'$). The output representation of the features will recall some latent features, which allows the network to run predictions at the nodes level.

image

The above figure reports a scheme of actions for GAT. Nodes share a weight matrix $W$ whose dimensions are $F\times F'$. The weight matrix $W$ is matrix-multiplied to every node’s feature to perform a first linear transformation. It follows a second linear transformation, through self-attention mechanism, by applying a feed-forward neural network $a$, to return attention coefficients $e$:

$$ e{ij}=a(Wh{i},Wh_{j}). $$

The attention coefficients output from $a$ passes through a LeakyReLu, before going to a softmax layer. The softmax is normalized on node $j$ 's features, to make all the coefficients easily comparable across different nodes:

$$ \alpha{ij}=\frac{exp(LeakyReLu(e{ij}))}{\Sigma{k\in N{i}}exp(LeakyReLu(e_{ik}))}. $$

Finally, applying a nonlinearity $\sigma$ the final feature description $h'$ is obtained. To further stabilize the learning process in the self-attention mechanism, authors have extended the algorithm to use multi-head attention. This approach consists of running multiple attention layers on different parts of the graph, in order to grasp some long-range dependencies.

Long-range dependencies are relationship between nodes far apart each other. The final set of features can be concatenated together and, as proposed in GAT’s paper, averaged to return the final node’s predictions. This averaging procedure, highlight that the output from GAT is an aggregated information from the neighborhood of target nodes.

NorbertZheng commented 1 year ago

GAT & PyTorch

The following code shows a simple PyTorch implementation of the core part of GAT. In call function GAT receives as input nodes’ features and edges (input), which reports the graph structural information. The input node states/features are initially multiplied by a the weight matrix $W$ to perform a first linear transformation: node_states_transformed = tf.matmul(node_states, self.kernel). The transformed states are gathered together with structural information, node_states_expanded, and from there a further linear transformation is run to obtain the attention coefficients: attention_scores = tf.nn.leaky_relu(tf.matmul(node_states_expanded, self.kernel_attention)).

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

class GraphAttention(layers.Layer):
    r""" This is the main class, which implements the attention 
    mechanism"""

    def __init__(self, units, kernel_initializer, kernel_regularizer=None, **kwargs):
        r""" Constructor 

        Parameters
        ----------
        units: int, number of attention units,
        kerner_initializer: str, initializer for keras kernels 
        kernal_regularizer: str, regularizer
        """
        super().__init__(**kwargs)
        self.units = units
        self.kernel_initializer = keras.initializers.get(kernel_initializer)
        self.kernel_regularizer = keras.regularizers.get(kernel_regularizer)

    def build(self, input_shape):
        r""" Build weights for attention layer

        Parameters
        ---------
        input_shape: int, input shape for weights dimension
        """

        self.kernel = self.add_weight(shape=(input_shape[0][-1], self.units),
                                      trainable=True,
                                      initializer=self.kernel_initializer,
                                      regularizer=self.kernel_regularizer,
                                      name="kernel",
                                     )
        self.kernel_attention = self.add_weight(shape=(self.units * 2, 1),
                                                trainable=True,
                                                initializer=self.kernel_initializer,
                                                regularizer=self.kernel_regularizer,
                                                name="kernel_attention",
                                               )
        self.built = True

    def call(self, inputs):
        r""" Main function to run attention

        Parameters
        ----------
        input: tuple, nodes features and edges
        """
        node_states, edges = inputs

        # W*h
        node_states_transformed = tf.matmul(node_states, self.kernel)
        # attention score
        # concatenate features with structural information
        node_states_expanded = tf.gather(node_states_transformed, edges)
        node_states_expanded = tf.reshape(node_states_expanded, (tf.shape(edges)[0], -1) )
        # compute the attention score from leaky_relu 
        attention_scores = tf.nn.leaky_relu(
            tf.matmul(node_states_expanded, self.kernel_attention)
        )
        attention_scores = tf.squeeze(attention_scores, -1)

        # softmax on attention scores
        attention_scores = tf.math.exp(tf.clip_by_value(attention_scores, -2, 2))
        attention_scores_sum = tf.math.unsorted_segment_sum(
            data=attention_scores,
            segment_ids=edges[:, 0],
            num_segments=tf.reduce_max(edges[:, 0]) + 1,
        )
        attention_scores_sum = tf.repeat(
            attention_scores_sum, tf.math.bincount(tf.cast(edges[:, 0], "int32"))
        )
        attention_scores_norm = attention_scores / attention_scores_sum

        # finally aggregate all the attention scores
        node_states_neighbors = tf.gather(node_states_transformed, edges[:, 1])
        out = tf.math.unsorted_segment_sum(
            data=node_states_neighbors * attention_scores_norm[:, tf.newaxis],
            segment_ids=edges[:, 0],
            num_segments=tf.shape(node_states)[0],
        )
        return out

Finally, softmax is applied to attention_scores (lines 70–79) and the neighbourhood feature information from the attention layer are returned to the main function.

The second most important bit is to implement the multi-head attention. The class defines the number of attention layers to be run in parallel, along with the attention layers units. The final output is the average across all the concatenated attention’s outputs: tf.reduce_mean(tf.stack(outputs, axis=-1), axis=-1).

class MultiHeadGraphAttention(layers.Layer):
    r""" Class for the multi-head attention """
    def __init__(self, units, num_heads=8, merge_type="concat", **kwargs):
        r""" Constructor, define the number of attention heads - namely how many
        attention process to run in parallel - the way to merge final results

        Parameters
        ----------
        units: int, number of units for the single attention
        num_heads: int, number of total attention layers to run in parallel
        merge_type: str, "concat" or nothing. "concat" concatenates all the output 
                    from attention, while nothing perform the final averaging 
        """
        super().__init__(**kwargs)
        self.num_heads = num_heads
        self.merge_type = merge_type
        self.attention_layers = [GraphAttention(units) for _ in range(num_heads)]

    def call(self, inputs):
        r""" Main calling function to run against input data 

        Parameters
        ----------
        input: tuple, nodes' features + nodes' edges 
        """
        atom_features, pair_indices = inputs
        # Obtain outputs from each attention head
        outputs = [
            attention_layer([atom_features, pair_indices])
            for attention_layer in self.attention_layers
        ]
        # Concatenate or average the node states from each head
        if self.merge_type == "concat":
            outputs = tf.concat(outputs, axis=-1)
        else:
            outputs = tf.reduce_mean(tf.stack(outputs, axis=-1), axis=-1)
        # Activate and return node states
        return tf.nn.relu(outputs)

Let’s finish this story, with the final implementation of GAT, to have the full clear picture of the algorithm:

class GraphAttentionNetwork(keras.Model):
    r""" Define the class for GAT"""
    def __init__(self, node_states, edges, hidden_units, num_heads, num_layers, output_dim, **kwargs,):
        r""" Constructor for GAT
        Parameters
        ----------
        node_states: array, nodes' features
        edges: array, structural info about nodes 
        hidden_units: int, number of hidden units to be used for preprocessing
        num_heads: int, number of heads for attention 
        num_layers: int, number of layers for attention
        output_dim: int, output dimension
        """"
        super().__init__(**kwargs)
        self.node_states = node_states
        self.edges = edges
        # preprocess
        self.preprocess = layers.Dense(hidden_units * num_heads, activation="relu")
        # define a num_layers of multi-head attention
        self.attention_layers = [
            MultiHeadGraphAttention(hidden_units, num_heads) for _ in range(num_layers)
        ]
        # final predictions
        self.output_layer = layers.Dense(output_dim)

    def call(self, inputs):
        r""" Main running function against input data 
        Parameters
        ----------
        inputs: tuple, nodes' features and edges
        """
        # read the input data and preprocess
        node_states, edges = inputs
        x = self.preprocess(node_states)
        # run the cycle over the attention layers
        for attention_layer in self.attention_layers:
            x = attention_layer([x, edges]) + x
        # produce the output
        outputs = self.output_layer(x)
        return outputs

    def train_step(self, data):
        r""" Train steps given the input data
        Parameters
        -----------
        data: tuple, nodes' indices and labels
        """
        indices, labels = data

        with tf.GradientTape() as tape:
            # run the call function
            outputs = self([self.node_states, self.edges])
            # compute the loss
            loss = self.compiled_loss(labels, tf.gather(outputs, indices))
        # backprop with gradients
        grads = tape.gradient(loss, self.trainable_weights)
        optimizer.apply_gradients(zip(grads, self.trainable_weights)
        # metrics check
        self.compiled_metrics.update_state(labels, tf.gather(outputs, indices))

        return {m.name: m.result() for m in self.metrics}

This is enough for today, enjoy me in the next story, where we will apply GAT to datasets and see what’s the real power of GAT :)

NorbertZheng commented 1 year ago

BIBLIOGRAPHY

  1. Veličković, Petar, et al. “Graph attention networks.” arXiv preprint arXiv:1710.10903 (2017).
  2. Scarselli, Franco, et al. “The graph neural network model.” IEEE transactions on neural networks 20.1 (2008): 61–80.
  3. Scarselli, Franco, et al. “Computational capabilities of graph neural networks.” IEEE Transactions on Neural Networks 20.1 (2008): 81–102.
  4. Scarselli, Franco, et al. “Graph neural networks for ranking web pages.” The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). IEEE, 2005.
  5. Gori, Marco, Gabriele Monfardini, and Franco Scarselli. “A new model for learning in graph domains.” Proceedings. 2005 IEEE international joint conference on neural networks. Vol. 2. №2005. 2005.
  6. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
  7. Perozzi, Bryan, et al. “Don’t Walk, Skip! Online learning of multi-scale network embeddings.” Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017. 2017.
  8. Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
  9. Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.
  10. Guthrie, David, et al. “A closer look at skip-gram modelling.” LREC. Vol. 6. 2006.
  11. Gittens, Alex, Dimitris Achlioptas, and Michael W. Mahoney. “Skip-Gram− Zipf+ Uniform= Vector Additivity.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017.
  12. Mimno, David, and Laure Thompson. “The strange geometry of skip-gram with negative sampling.” Empirical Methods in Natural Language Processing. 2017.
  13. Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).
  14. Estrach, Joan Bruna, et al. “Spectral networks and deep locally connected networks on graphs.” 2nd International Conference on Learning Representations, ICLR. Vol. 2014. 2014.
  15. Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst. “Convolutional neural networks on graphs with fast localized spectral filtering.” Advances in neural information processing systems 29 (2016): 3844–3852.
  16. Duvenaud, David, et al. “Convolutional networks on graphs for learning molecular fingerprints.” arXiv preprint arXiv:1509.09292 (2015).
  17. Hammond, David K., Pierre Vandergheynst, and Rémi Gribonval. “Wavelets on graphs via spectral graph theory.” Applied and Computational Harmonic Analysis 30.2 (2011): 129–150.
  18. Atwood, James, and Don Towsley. “Diffusion-convolutional neural networks.” Advances in neural information processing systems. 2016.
  19. Atwood, James, et al. “Sparse Diffusion-Convolutional Neural Networks.” arXiv preprint arXiv:1710.09813 (2017).
  20. Ye, Nong. “A markov chain model of temporal behavior for anomaly detection.” Proceedings of the 2000 IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop. Vol. 166. West Point, NY, 2000.
  21. Hayes, Brian. “First links in the Markov chain.” American Scientist 101.2 (2013): 252.
  22. Orey, Steven. Limit theorems for Markov chain transition probabilities. London: van Nostrand, 1971.
  23. Cho, Kyunghyun, et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).
  24. Chorowski, Jan K., et al. “Attention-based models for speech recognition.” Advances in neural information processing systems 28 (2015).
  25. Bahdanau, Dzmitry, et al. “End-to-end attention-based large vocabulary speech recognition.” 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016.
  26. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
  27. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
  28. Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. “Effective approaches to attention-based neural machine translation.” arXiv preprint arXiv:1508.04025 (2015).