2021 assignment 3 Q2: the specified dimensions for attn_mask argument is wrong

In the forward() for MultiHeadAttention class in assignment3/cs231n/transformer_layers.py
In the argument list provided by the setup code:attn_mask: Array of shape (T, S) where mask[i,j] == 0 should be attn_mask: Array of shape (S, T) where mask[i,j] == 0
If attn_mask is of (T, S) shape, then it needs to be transposed because the product of the query and key matrix is of the shape (batch_size, num_heads, S, T)
so the code for masking should be query_key_product.masked_fill(torch.transpose(attn_mask, 0,1) == 0, -np.inf)) which doesn't give the output value provided by expected_masked_self_attn_output. The output only matches the provided output if people don't transpose attn_mask which is wrong

cs231n / cs231n.github.io

2021 assignment 3 Q2: the specified dimensions for attn_mask argument is wrong #279