I found a mistake. Specifically, the second eqution showing in your paper is different with your code. The eqution 2 of your paper shows the Mask added with QK, but in your code, I found you use the function 'masked_fill' to achive multiplication of mask and QK. Please give me some explaination.
i have the same question. A mask could be all -inf because everything was below the threshold. After softmax, these would return nan tensors, which means no back propagation. How to mask properly in this case?
I found a mistake. Specifically, the second eqution showing in your paper is different with your code. The eqution 2 of your paper shows the Mask added with QK, but in your code, I found you use the function 'masked_fill' to achive multiplication of mask and QK. Please give me some explaination.