Closed tang-ed closed 11 months ago
Please could you clarify what is meant by pad ID here?
Is the ID for padding the sentence
There's no need for an attention mask in this case, as the architecture enforces causality internally (please see the internal usage of the D matrix in retention.py
).
Hello, I have reviewed some of the code and did not use an attention mask. It's retnet. Don't you need to cover up the pad ID? Or does the pad ID have no impact on the previous sequence?