Closed feiyulv closed 6 years ago
Hi @feiyulv, there's no special handling of padding tokens for the attention mechanism.
@jgehring Thx, but is that ok, since attention should focus on no-padding words?
For training, the data is sorted by sentence length so the amount of padding should be small. For validation, testing and generation we limit mini-batches to source sentences with equal length to avoid any influence of padding.
But yeah, in theory the models should learn to ignore the padding.
Got it. I'm implementing fairseq using mxnet, however the result is very bad. Maybe I miss some important tricks.
@feiyulv how is your re-implementation progressing? Do you have any other questions related to attention and padding? Otherwise I'll close this thread.
@jgehring Result is still not very good. The gru based nmt can achieve belu value 41 on our chinese-english task. My implmentation can only got 28. I use adadelta optimizer with graident clip 1.0, it converages very fast to a local optimum. I also remove the weight-norm from the net since it is not convnient to realize on mxnet. Any suggestion?
We didn't run many experiments with adadelta as some preliminary tests were not very promising. In our experience, WeightNorm is pretty helpful for fast convergence (you can check out the ablations in Language Modeling with Gated Convolutional Networks)
I don't know exactly what you implemented, but in my experience, attention over padding made the results too bad. So the solution was not using batch, but just using one by one to avoid padding issues. Or different dataset from online inputs also should be sorted by the length to make the sentences keeping the same or similar size in batch Otherwise, ultimately attention should be modified over padding.
@jgehring thx for your advices @neuraltalk I use mask to handle the padding when training with batch. Mask is a 2d array, (batch-size, seq-len), presudo code like this:
energy = energy * mask + (mask-1.0) * 100000
att = softmax_activation(energy)
For padding word, mask value is 0, otherwise 1
Does fairseq handle the padding tokens when calculating the attention weights. I have read the source code and can't find the code doing this job. @jgehring