Open gabrer opened 5 years ago
From what I understand, self.query_embedding is initialized as a nn.Embedding(number_of_context_layers_you_want, dim_of_LSTM) layer which has gradient and that means it's trainable.
When you do this: sent_w = self.query_embedding(torch.LongTensor(bsize*[0]).cuda()).unsqueeze(2), you are not randomly initializing it on each pass. The [0] means the first layer of the total number of layers you declared. You are actually getting the parameters of the query embedding. It's like the concept of embedding[0].
I was having a look at the implementation of the "InnerAttentionNAACLEncoder" which should be the sentence encoder from the "Hierarchical Attention Networks for Document Classification" by Tang et al. 2016.
However, I would raise the following issues:
On line 538, it sums up the product of the Alphas weights
alphas
with the linear projection of the hidden state of each word. However, in the original paper, they just multiply the weights with the original hidden states representing the words, which should besent_output
rather thansent_output_proj
.On line 529, it computes the dot product of the projected hidden state
sent_key_proj
with the so-called Context vectorsent_w
(i.e. u_it and u_w, respectively in the paper). However, it looks thatsent_w
it is -randomly- instantiated at each iteration withVariable(torch.LongTensor(bsize*[0]).cuda())
, input to an embedding layer. I am wondering whether this vector should be instead a model parameter learned during the training, as stated in the paper. But this part of the code is not very clear to me.It uses BiLSTM, while the paper states they have used BiGRU.
It extracts
self.pool_type
without using it. Might be a typo?