Cartus / AGGCN

Attention Guided Graph Convolutional Networks for Relation Extraction (authors' PyTorch implementation for the ACL19 paper)
MIT License
433 stars 88 forks source link

Question about the adjacent matrix #2

Closed speedcell4 closed 5 years ago

speedcell4 commented 5 years ago

Hi~

In your abstract, you said "In this work, we propose Attention Guided Graph Convolutional Networks (AGGCNs), a novel model which directly takes full dependency trees as inputs", but I don's see how do you make use of the dependency tree.

According to Equation (2), seems you just construct the $\tilde{A}$ by using the input embeddings, so where do you make use of the original adjacency matrix $A$?

In your code, I still cannot confirm how do you make use of $A$, which is adj in your code, right? https://github.com/Cartus/AGGCN_TACRED/blob/master/model/aggcn.py#L175

Could you please show me the details? And by the way, that is the matrix $V$ in your Equation (2)? I can not find the definition of $V$ neither.

speedcell4 commented 5 years ago

Yes, I know you made mask by using adj, but isn't this equivalent to src_mask above?

Cartus commented 5 years ago

Hi,

For the first question, there is a description in the paper:

In practice, we treat the original adjacency matrix as an initialization so that the dependency information can be captured in the node representations for later attention calculation. The attention guided layer is included starting from the second block.

When you look at the codes, I created two types of graph convolutional layers from line 126

# gcn layer
for i in range(self.num_layers):
    if i == 0:
        self.layers.append(GraphConvLayer(opt, self.mem_dim, self.sublayer_first))
        self.layers.append(GraphConvLayer(opt, self.mem_dim, self.sublayer_second))
    else:
        self.layers.append(MultiGraphConvLayer(opt, self.mem_dim, self.sublayer_first, self.heads))
        self.layers.append(MultiGraphConvLayer(opt, self.mem_dim, self.sublayer_second, self.heads))

For the first block, we use the original adjacency matrix from the dependency tree. For the second block, we use the calculated adjacency matrix based on the representations (we assume they have already captured the dependency relations since they are obtained from the first block). You can refer to the code from line 170:

for i in range(len(self.layers)):
    if i < 2:
        outputs = self.layers[i](adj, outputs)
        layer_list.append(outputs)
    else:
        attn_tensor = self.attn(outputs, outputs, src_mask)
        attn_adj_list = [attn_adj.squeeze(1) for attn_adj in torch.split(attn_tensor, 1, dim=1)]
        outputs = self.layers[i](attn_adj_list, outputs)
        layer_list.append(outputs)

When i < 2, the adj represents the original dependency tree.

In Equation (2), that is a typo. Thank you so much for pointing it out! We do not need value here since we only use the query and key to calculate the correlation scores.

Cartus commented 5 years ago

I close this issue if you have any further questions feel free to reopen it.

speedcell4 commented 5 years ago

But why do you need to double your sub-layers? I mean your sublayer_first and sublayer_second?

speedcell4 commented 5 years ago

And there is not reopen button on this page

Cartus commented 5 years ago

For the sub-layer problem, u can refer to this TACL paper-DCGCN. Basically, the motivation behind this is to imitate convolution filters of different sizes (1x1, 3x3, etc.) in CNN.

The number of sub-layers in each block is different for TACRED. Here the first sub-layer is 2, the second is 4. You can refer to the train.py. These are hyper-parameters.

speedcell4 commented 5 years ago

Thank you~

goPikachu88 commented 5 years ago

Hello, I still have some questions about the number of layers.

In your AGGCN paper, Section 3.2, you mentioned that the best setting for sentence-level relation extraction is M=2, L=5.

For the sub-layer problem, u can refer to this TACL paper-DCGCN. Basically, the motivation behind this is to imitate convolution filters of different sizes (1x1, 3x3, etc.) in CNN.

The number of sub-layers in each block is different for TACRED. Here the first sub-layer is 2, the second is 4. You can refer to the train.py. These are hyper-parameters.

Cartus commented 5 years ago

Hi @ardellelee,

For the questions you mentioned above:

  • Is L the sum of sublayer_first and sublayer_second?

Yes, you are correct.

  • As I understand, M is the number of AGGCN blocks, and should be identical to the argument --num_layers in train.py. However, the argument is indicated as "Num of RNN layers" in the code. A bit confused. Is it a typo?

Yes, it is a typo... Thank you so much for pointing it out! It should be the number of blocks which is M.

vhientran commented 5 years ago

Hello @Cartus , you mentioned that the best setting for sentence-level relation extraction is M=2, L=5. L the sum of sublayer_first and sublayer_second, with the first sub-layer is 2, the second is 4. But 2+4 = 5 ?

vhientran commented 5 years ago

Hi @speedcell4 , could you explain for me? :)

speedcell4 commented 5 years ago

@TranVanHien Sorry, I can not, I realized I can never figure out these complicated experiment settings and then I decided to give up doing relation extraction task two months ago

vhientran commented 5 years ago

Thank you.

Cartus commented 5 years ago

Hi @TranVanHien , that is a typo in the paper, I will fix it later. The code in this repo is the default setting for the TACRED dataset.

Thank you for pointing it out!

vhientran commented 5 years ago

I see. Thank you for your quick reply.

liuyijiang1994 commented 5 years ago

Hi, Cartus @Cartus I read your answers about the use of dependency tree above, you takes it as an initialization in your work, so it just works in the first GCN layer, right? If so, thence the second GCN layer, the adj matrix becomes the weight produced by self attention, and structure becomes to a full strongly connected directed graph, is there any difference between the calculation of self attention and the attention GCN in a full strongly connected directed graph? Thanks for your help!

Cartus commented 5 years ago

Hi @liuyijiang1994 ,

Thanks for the question. Actually they are basically the same. I read a paper few weeks ago. The paper discusses about the GNNs and Transformer model, which might be useful. They also have a nice summary of these works including different GNNs and different self-attention methods.

Contextualized Non-local Neural Networks for Sequence Learning https://arxiv.org/abs/1811.08600#

liuyijiang1994 commented 5 years ago

@Cartus Thank you for your prompt reply, it is very helpful to me!