Jamie-Stirling / RetNet

An implementation of "Retentive Network: A Successor to Transformer for Large Language Models"
MIT License
1.14k stars 99 forks source link

Some Questions about Attention Mask #2

Closed tang-ed closed 11 months ago

tang-ed commented 11 months ago

Hello, I have reviewed some of the code and did not use an attention mask. It's retnet. Don't you need to cover up the pad ID? Or does the pad ID have no impact on the previous sequence?

Jamie-Stirling commented 11 months ago

Please could you clarify what is meant by pad ID here?

tang-ed commented 11 months ago

Is the ID for padding the sentence

Jamie-Stirling commented 11 months ago

There's no need for an attention mask in this case, as the architecture enforces causality internally (please see the internal usage of the D matrix in retention.py).