Open wangwang110 opened 2 years ago
for replace , when we calculate attention score of position i , we don't consider the token w(i).
at the first layer , I think it is no problem, but we use the info of w(i) indirectly at the seconder or upper layers.
Is it ok ?
for replace , when we calculate attention score of position i , we don't consider the token w(i).
at the first layer , I think it is no problem, but we use the info of w(i) indirectly at the seconder or upper layers.
Is it ok ?