Open BraveDrXuTF opened 1 month ago
Hi, I'm not sure what type of padding you are mentioning. We pad sequences of different lengths to the same length using 0 padding. So this padding does not influence the attention computation. And we do not need causal masking for attention computation on spatial data.
Hi, @HaoZhongkai Thank you for your response. The padding I mention is just padding operations on sequences of different lengths.
After K, V mapping, because you do not set the bias of K V linear layer to zero, the result at padded place of the tensors after K V linear transform would not be zero either.
Of course not zero might not be a big problem...
But I suggest it is still better adding a mask for K V Q so that the attention matrix would not be effected by these unreal padded positions.
In my experiments, as you can see, even in the first attention block of GNOT, the last few elements in the total length of the k has become slightly different. This might be caused by numerical error or something,
(Pdb) k[:,10,-2:,10:14]
tensor([[[ 3.0466e-04, -3.8661e-04, -1.5450e-04, 5.3415e-06],
[ 3.0456e-04, -3.8309e-04, -1.5212e-04, 3.2105e-06]],
[[ 3.0484e-04, -3.8515e-04, -1.5213e-04, 7.4536e-06],
[ 3.0420e-04, -3.8171e-04, -1.5517e-04, 6.6272e-06]],
[[ 2.9967e-04, -3.8171e-04, -1.4730e-04, 6.0417e-06],
[ 2.9956e-04, -3.8265e-04, -1.4569e-04, 6.8420e-06]],
[[ 3.0649e-04, -3.8577e-04, -1.5380e-04, 5.6187e-06],
[ 3.0878e-04, -3.8615e-04, -1.5406e-04, 5.9028e-06]],
[[ 2.9486e-04, -3.8043e-04, -1.4438e-04, 9.2388e-06],
[ 2.9771e-04, -3.8179e-04, -1.4543e-04, 7.7717e-06]],
[[ 3.0080e-04, -3.8102e-04, -1.4764e-04, 9.0702e-06],
[ 2.9898e-04, -3.8154e-04, -1.4521e-04, 9.6725e-06]],
[[ 2.9786e-04, -3.7959e-04, -1.4511e-04, 9.8170e-06],
[ 2.9833e-04, -3.7960e-04, -1.4366e-04, 7.0692e-06]],
[[ 3.0522e-04, -3.8424e-04, -1.5075e-04, 4.8854e-06],
[ 3.0750e-04, -3.8452e-04, -1.5250e-04, 7.1328e-06]]],
In NLP, we have mask machanism to help prevent this. But in GNOT, the following code in https://github.com/HaoZhongkai/GNOT/blob/master/models/cgpt.py seems no mask procedure
. So it seems the element in the attention matrix is contaminated by the padded part. Is it true? Thanks.