Cartus / AGGCN

Attention Guided Graph Convolutional Networks for Relation Extraction (authors' PyTorch implementation for the ACL19 paper)
MIT License
432 stars 88 forks source link

About "M identical blocks" #23

Closed ysn7 closed 4 years ago

ysn7 commented 4 years ago

Hello, I would like to ask you how to understand “M identical blocks” in the paper? and,What's the specific meaning of this? thank you!

Cartus commented 4 years ago

M is a hyper-parameter, you can set M = 2, 3, 4.... for different tasks.

ysn7 commented 4 years ago

ok,thank you,could you tell me about the relationship between the 'n' of matrixA(n×n) and hyper-parameter 'N'? Are their values equal?

Cartus commented 4 years ago

The hyper-parameter N indicates the number of attention heads.

For example, if you used 3 heads (N=3), 3 attention matrices will be generated. Each matrix has the size n x n, where n is the length of the sentence (number of tokens).

ysn7 commented 4 years ago

thank you very much ,i got it.and sorry, i have some other questions. the first, Why use sublayers here in the GCN(sublayer_first=2,sublayer_second=4)? the second,How is the heads decided, why is it 3? How would it choose which nodes in the sentence as head nodes?

Cartus commented 4 years ago

For the first question, someone had a similar one before as here: #2

For the second question, the number of heads is a hyper-parameter. It is not related to head nodes. Instead, it is a terminology used in the multi-head attention mechanism. Please refer to the paper: Attention is all you need

ysn7 commented 4 years ago

Ok,thanks for your patient. and in the code: aggcn.py about definition of the "class MultiHeadAttention" There are only Query and key ,where is the definition of "Value"? def forward(self, query, key, mask=None):#传入的数据(q,k,mask) if mask is not None: mask = mask.unsqueeze(1)

    nbatches = query.size(0)

    query, key = [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
                         for l, x in zip(self.linears, (query, key))]
    # query = query.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
    # key = key.view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
    attn = attention(query, key, mask=mask, dropout=self.dropout)

    return attn
Cartus commented 4 years ago

The reason is that we just need the attention matrix, which is treated as the adjacency matrix. GCN requires the adjacency matrix as the input. The key motivation of our paper is to leverage the multi-head attention mechanism to learn the adjacency matrix rather than directly derived from the dependency tree.

I suggest you go through our paper and the related references carefully. I won't be able to answer every detailed question here.

ysn7 commented 4 years ago

Thank you very much