Jamie-Stirling / RetNet

An implementation of "Retentive Network: A Successor to Transformer for Large Language Models"
MIT License
1.14k stars 99 forks source link

a question about xpos and D of decay mat #35

Open DavideHe opened 7 months ago

DavideHe commented 7 months ago

as we all know, xpos has decay ability , but you add D of decay mat after Q @K^T .. Is It redundant ?

Chandler-Q commented 6 months ago

I think your concern is justified. The decay ability exists in xpos. The decay of matrix D may be redundant.

Roschach-02 commented 5 months ago

It seems that XPOS is not used directly in retention, but is split into Θ and D.