In the forward() for MultiHeadAttention class in assignment3/cs231n/transformer_layers.py In the argument list provided by the setup code:attn_mask: Array of shape (T, S) where mask[i,j] == 0 should be attn_mask: Array of shape (S, T) where mask[i,j] == 0 If attn_mask is of (T, S) shape, then it needs to be transposed because the product of the query and key matrix is of the shape (batch_size, num_heads, S, T) so the code for masking should be query_key_product.masked_fill(torch.transpose(attn_mask, 0,1) == 0, -np.inf)) which doesn't give the output value provided by expected_masked_self_attn_output. The output only matches the provided output if people don't transpose attn_mask which is wrong
In the
forward()
forMultiHeadAttention
class inassignment3/cs231n/transformer_layers.py
In the argument list provided by the setup code:
attn_mask: Array of shape (T, S) where mask[i,j] == 0
should beattn_mask: Array of shape (S, T) where mask[i,j] == 0
If attn_mask is of
(T, S)
shape, then it needs to be transposed because the product of the query and key matrix is of the shape(batch_size, num_heads, S, T)
so the code for masking should be
query_key_product.masked_fill(torch.transpose(attn_mask, 0,1) == 0, -np.inf))
which doesn't give the output value provided byexpected_masked_self_attn_output
. The output only matches the provided output if people don't transposeattn_mask
which is wrong