PetarV- / GAT

Graph Attention Networks (https://arxiv.org/abs/1710.10903)
https://petar-v.com/GAT/
MIT License
3.18k stars 642 forks source link

Confusion about some codes #15

Closed DexterZeng closed 5 years ago

DexterZeng commented 5 years ago

Hi, Thank you for sharing the codes! I find the codes below (in def sp_attn_head.py) is a bit hard to understand, since they are not directly corresponding to the original description in text:

f_1 = tf.layers.conv1d(seq_fts, 1, 1)
f_2 = tf.layers.conv1d(seq_fts, 1, 1)

f_1 = tf.reshape(f_1, (nb_nodes, 1))
f_2 = tf.reshape(f_2, (nb_nodes, 1))

f_1 = adj_mat*f_1
f_2 = adj_mat * tf.transpose(f_2, [1,0])

logits = tf.sparse_add(f_1, f_2)

lrelu = tf.SparseTensor(indices=logits.indices, 
values=tf.nn.leaky_relu(logits.values), 
dense_shape=logits.dense_shape)

coefs = tf.sparse_softmax(lrelu)

seq_fts = tf.layers.conv1d(seq, out_sz, 1, use_bias=False) can be regarded as multiplying a weight matrix W (converting F features to F’ features). However, I do not understand the following steps:

f_1 = tf.layers.conv1d(seq_fts, 1, 1)
f_2 = tf.layers.conv1d(seq_fts, 1, 1)
f_1 = adj_mat*f_1
f_2 = adj_mat * tf.transpose(f_2, [1,0])
logits = tf.sparse_add(f_1, f_2)

I am aware that the codes are different since it is conducted on a “matrix” level instead of node level. But I cannot see how the attention mechanism is achieved by these steps. Can you help explain a bit?

Many thanks, Weixin.

PetarV- commented 5 years ago

Hi Weixin,

Thanks for your interest in the GAT work and for reaching out. I'll try to elaborate on those lines in more detail.

As our layer is a single-layer NN, we can decompose it from W[h_i||h_j] to W_1h_i + W_2h_j (with W_1 and W_2 being separate linear transformations).

As W_1 and W_2 are pointwise applied to each node, we may express them as a 1D convolution with kernel size 1 across the node feature matrix. Applying these two convolutions gives us f_1 (which is W_1h_i) and f_2 (which is W_2h_j).

In the dense GAT layer (https://github.com/PetarV-/GAT/blob/master/utils/layers.py#L13-L17), we compute W_1h_i + W_2h_j for all (i, j) pairs, by simply adding f_1 to transpose(f_2). This implies adding something of shape (batch, nodes, 1) to something of shape (batch, 1, nodes), and through the power of broadcasting we obtain the full (unnormalised) attention coefficient matrix of shape (batch, nodes, nodes), with TensorFlow automatically performing this sum for all pairs of positions. You can then normalise the coefficients by applying LeakyReLU, bias and softmax.

For the sparse layer, the idea is pretty much the same, only we do not wish to use more memory than is necessary. This is why we multiply the values with the sparse matrix's entries, to preserve only the positions i and j that are actually to be used.

Does that help? Feel free to reach out if you need more info.

Thanks, Petar

DexterZeng commented 5 years ago

Hi Petar:

Thank you for your detailed explanation, which helps a lot!

Many thanks, Weixin.

KnightOfTheMoonlight commented 4 years ago

Hi, Petar, thanks for your explanation! However I have some questions about the bias matrix. How to generate a bias matrix, what's the physical meaning of bias, and why you choose this generation setting? This looks like the default bias matrix generation for dense adjacency matrix. https://github.com/PetarV-/GAT/blob/ea3aeafb782e20738a090dbf7914aa6e5a1a137e/utils/process.py#L14 https://github.com/PetarV-/GAT/blob/ea3aeafb782e20738a090dbf7914aa6e5a1a137e/utils/process.py#L25

def adj_to_bias(adj, sizes, nhood=1):
    nb_graphs = adj.shape[0]
    mt = np.empty(adj.shape)
    for g in range(nb_graphs):
        mt[g] = np.eye(adj.shape[1])
        for _ in range(nhood):
            mt[g] = np.matmul(mt[g], (adj[g] + np.eye(adj.shape[1])))
        for i in range(sizes[g]):
            for j in range(sizes[g]):
                if mt[g][i][j] > 0.0:
                    mt[g][i][j] = 1.0
    return -1e9 * (1.0 - mt)
songhuiming commented 4 years ago

Hi, Petar, thanks for your explanation! However I have some questions about the bias matrix. How to generate a bias matrix, what's the physical meaning of bias, and why you choose this generation setting? This looks like the default bias matrix generation for dense adjacency matrix. https://github.com/PetarV-/GAT/blob/ea3aeafb782e20738a090dbf7914aa6e5a1a137e/utils/process.py#L14 https://github.com/PetarV-/GAT/blob/ea3aeafb782e20738a090dbf7914aa6e5a1a137e/utils/process.py#L25

def adj_to_bias(adj, sizes, nhood=1):
    nb_graphs = adj.shape[0]
    mt = np.empty(adj.shape)
    for g in range(nb_graphs):
        mt[g] = np.eye(adj.shape[1])
        for _ in range(nhood):
            mt[g] = np.matmul(mt[g], (adj[g] + np.eye(adj.shape[1])))
        for i in range(sizes[g]):
            for j in range(sizes[g]):
                if mt[g][i][j] > 0.0:
                    mt[g][i][j] = 1.0
    return -1e9 * (1.0 - mt)

until mt[g][i][j] = 1.0, it is to add self-loop to the adj matrix.