Extract features from Graph attention network

monk1337 commented 4 years ago

I am trying to extract only features from graph attention network, I was using Gcn as feature extractor and I want to replace it with GAT

gc1 = GraphConvolution( input_dim = 300,  output_dim = 1024, 'first_layer')( features_matrix, adj_matrix )
gc2 = GraphConvolution(input_dim = 1024, output_dim = 10 ,'second_layer') (gc1, adj_matrix)

Where GraphConvolution layer is defined as :

class GraphConvolution():
    """Basic graph convolution layer for undirected graph without edge labels."""
    def __init__(self, input_dim, output_dim, name, dropout=0., act=tf.nn.relu):
        self.name = name
        self.vars = {}

        with tf.variable_scope(self.name + '_vars'):
            self.vars['weights'] = weight_variable_glorot(input_dim, output_dim, name='weights')
        self.dropout = dropout

        self.act = act

    def __call__(self, inputs, adj):

        with tf.name_scope(self.name):        
            x = inputs
            x = tf.nn.dropout(x, 1 - self.dropout)
            x = tf.matmul(x, self.vars['weights'])
            x = tf.matmul(adj, x)
            outputs = self.act(x)
        return outputs

Now to replace gcn layer with GAT, I tried this :

from gat import GAT

# Because Gat is accepting 3d input [ batch, node, features ]
features_matrix     = tf.expand_dims(features_matrix, axis = 0)
adj_matrix          = tf.expand_dims(adj_matrix, axis = 0)

gat_logits = GAT.inference( inputs = features_matrix, 
                                 nb_classes  = 10, 
                                 nb_nodes    = 22, 
                                 training    = True,
                                 attn_drop   = 0.0, 
                                 ffd_drop    = 0.0,
                                 bias_mat    = adj_matrix,
                                 hid_units   = [8], 
                                 n_heads     = [8, 1],
                                 residual    = False, 
                                 activation  = tf.nn.elu)

Now I want to get just the logits from GAT as features and it should learn the features too, so I set training = True

But the accuracy from GCN features I was getting around 90% but in GAT features I am not able to get accuracy more than 80 %, instead, it should increase the accuracy compared to GCN.

Is there anything I am missing in the network or my hyperparameters are not correct to compare to the hyperparameters i was using in GCN.

@PetarV- @gcucurull Can you suggest me how I can extract feature from GAT and if I am doing correct way then why I am not getting good accuracy.

Thank you

gcucurull commented 4 years ago

Looking at your GCN, it looks like the first layer has a hidden size of 1024 units, whereas for GAT you have set hid_units to [8]. This means that the first layer has 8 heads with a hidden size of size 8 each, which is much smaller than the hidden size used in the GCN.

You can try changing hid_units to [128] or [256], which will increase the hidden size of each head in the first layer, increasing the capacity of the model.

monk1337 commented 4 years ago

@gcucurull And the code which I am using is correct? I had doubt that maybe I am not using the network properly.

gcucurull commented 4 years ago

@monk1337 Yes, it looks good.

monk1337 commented 4 years ago

@gcucurull Thanks for the quick response. I have one more doubt, So in attention heads, we are passing a list [8, 1]

I went through the code and the got the idea that it's for the output layer.

for i in range(n_heads[-1]):
            out.append(layers.attn_head(h_1, bias_mat=bias_mat,
                out_sz=nb_classes, activation=lambda x: x,
                in_drop=ffd_drop, coef_drop=attn_drop, residual=False))
        logits = tf.add_n(out) / n_heads[-1]

But what should be the ratio between input heads and output heads? How it is affecting output?

gcucurull commented 4 years ago

The output layer is the one computing the logits, if you use multiple heads, the final logits will be the average over the logits produced by each output head.

However, in our experiments we always used only 1 output head, that's why it is set to [8,1].

There isn't really a ratio between input heads and output heads since the number of output heads should be 1. The number of input heads basically controls the number of parameters of the model and its expressive power, so you might want to increase or decrease depending on your task.

monk1337 commented 4 years ago

@gcucurull Thanks for the response. Is it input heads are same as no_of_classes? Or what is the default head size should I use if I have big graph?

gcucurull commented 4 years ago

The number of input heads and number of classes is not related.

8 input heads worked well for our case so I suggest starting with that value and tweaking it empirically. Increasing it will increase the capacity of the model, decreasing it will reduce it but also speeds things up and lower the memory consumption.

monk1337 commented 4 years ago

@gcucurull I tried to experiment with no of heads and hidden units from range 2 to 1024 but couldn't get accuracy nearby GCN layer which I showed above. GCN is producing 90% accuracy and Gat is not crossing more than 85% after many combinations of hidden units and no of heads. I also tried to add two layers of GAT, let me know if it is correct :

        logits_graph = GAT.inference( inputs = realtion_batch, 
                                 nb_classes  = 800, 
                                 nb_nodes    = 22, 
                                 training    = True,
                                 attn_drop   = 0.0, 
                                 ffd_drop    = 0.0,
                                 bias_mat    = adj_batch,
                                 hid_units   = [8],
                                 n_heads     = [8,1],
                                 residual    = False, 
                                 activation  = tf.nn.elu)

        logits_graph_s = GAT.inference( inputs = logits_graph, 
                                 nb_classes  = 256,
                                 nb_nodes    = 22, 
                                 training    = True,
                                 attn_drop   = 0.0, 
                                 ffd_drop    = 0.0,
                                 bias_mat    = adj_batch,
                                 hid_units   = [8],
                                 n_heads     = [8,1],
                                 residual    = False, 
                                 activation  = tf.nn.elu)

But when I tried these two layers, accuracy is 0.0 for 100 epochs.

Why Gat is not performing better than GCN?

gcucurull commented 4 years ago

The code is not quite ok.

First of all, if you want to have multiple GAT layers, you don't have to call GAT.inference twice, you have to increase the number of elements in the hid_units list. Also, why do you set nb_classes to 800? Do you really have 800 classes? You also seem to be working with very small graphs, with nb_nodes set to 22.

The correct way to have a GAT model with 2 layers, with 8 heads per layer and 128 units per head is the following:

    logits_graph = GAT.inference( inputs = realtion_batch, 
                             nb_classes  = NUMBER_OF_OUTPUT_CLASSES, 
                             nb_nodes    = NUMBER_OF_NODES, 
                             training    = True,
                             attn_drop   = 0.0, 
                             ffd_drop    = 0.0,
                             bias_mat    = adj_batch,
                             hid_units   = [128, 128],
                             n_heads     = [8, 8, 1],
                             residual    = False, 
                             activation  = tf.nn.elu)

monk1337 commented 4 years ago

@gcucurull n_heads are [8,8,1] or [128,128,1] ?

gcucurull commented 4 years ago

Sorry, you are right, it is [8, 8, 1], I edited the message to correct it.

gcucurull commented 4 years ago

Did this work?

monk1337 commented 4 years ago

Yup

PetarV- / GAT

Extract features from Graph attention network #40