Question about attention and padding

facebookresearch / fairseq-lua

Facebook AI Research Sequence-to-Sequence Toolkit

Other

3.74k stars 616 forks source link

Question about attention and padding #82

Closed feiyulv closed 6 years ago

feiyulv commented 7 years ago

Does fairseq handle the padding tokens when calculating the attention weights. I have read the source code and can't find the code doing this job. @jgehring

jgehring commented 7 years ago

Hi @feiyulv, there's no special handling of padding tokens for the attention mechanism.

feiyulv commented 7 years ago

@jgehring Thx, but is that ok, since attention should focus on no-padding words?

jgehring commented 7 years ago

For training, the data is sorted by sentence length so the amount of padding should be small. For validation, testing and generation we limit mini-batches to source sentences with equal length to avoid any influence of padding.

But yeah, in theory the models should learn to ignore the padding.

feiyulv commented 7 years ago

Got it. I'm implementing fairseq using mxnet, however the result is very bad. Maybe I miss some important tricks.

jgehring commented 7 years ago

@feiyulv how is your re-implementation progressing? Do you have any other questions related to attention and padding? Otherwise I'll close this thread.

feiyulv commented 7 years ago

@jgehring Result is still not very good. The gru based nmt can achieve belu value 41 on our chinese-english task. My implmentation can only got 28. I use adadelta optimizer with graident clip 1.0, it converages very fast to a local optimum. I also remove the weight-norm from the net since it is not convnient to realize on mxnet. Any suggestion?

jgehring commented 7 years ago

We didn't run many experiments with adadelta as some preliminary tests were not very promising. In our experience, WeightNorm is pretty helpful for fast convergence (you can check out the ablations in Language Modeling with Gated Convolutional Networks)

neuraltalk commented 7 years ago

I don't know exactly what you implemented, but in my experience, attention over padding made the results too bad. So the solution was not using batch, but just using one by one to avoid padding issues. Or different dataset from online inputs also should be sorted by the length to make the sentences keeping the same or similar size in batch Otherwise, ultimately attention should be modified over padding.

feiyulv commented 7 years ago

@jgehring thx for your advices @neuraltalk I use mask to handle the padding when training with batch. Mask is a 2d array, (batch-size, seq-len), presudo code like this:

energy = energy * mask + (mask-1.0) * 100000
att  = softmax_activation(energy)

For padding word, mask value is 0, otherwise 1