Why is the last element being dropped from the mask? I believe the mask only has as many 1s as there are non-padded values in the sequence, so if the sequence is (in 1D, for the example) [1, 2, 3, PADDING], the mask will be [1, 1, 1, 0]. This would make inputs [1, 2, 3] and targets [2, 3, PADDING]. Since the last element of the mask is being dropped, the mask becomes [1, 1, 1], and so the prediction going from 3 -> PADDING gets counted toward the loss. Is this a feature of transformers or am I missing something?
@dnguyengithub
Why is the last element being dropped from the mask? I believe the mask only has as many 1s as there are non-padded values in the sequence, so if the sequence is (in 1D, for the example) [1, 2, 3, PADDING], the mask will be [1, 1, 1, 0]. This would make inputs [1, 2, 3] and targets [2, 3, PADDING]. Since the last element of the mask is being dropped, the mask becomes [1, 1, 1], and so the prediction going from 3 -> PADDING gets counted toward the loss. Is this a feature of transformers or am I missing something?