So, after checking that your monkey patch did not use attention_mask parameter in forward, I learned that there is a class named LowerTriangularMaskWithTensorBias that we could add attention_mask inside.
I also enabling use_cache because the script still using past_key_value and it even concatenated to current key and value tensor.
So, after checking that your monkey patch did not use
attention_mask
parameter in forward, I learned that there is a class namedLowerTriangularMaskWithTensorBias
that we could addattention_mask
inside.I also enabling
use_cache
because the script still usingpast_key_value
and it even concatenated to current key and value tensor.