Closed achen46 closed 1 year ago
@achen46 You're correct -- RetNet does not provide any benefits for non-autoregressive tasks. The primary innovation is the dual parallel/recurrent formulation. If you don't need recurrent (autoregressive) updates, the parallel formulation is essentially just a Transformer variant.
P.S. Sorry for the late response -- have been extremely busy the past few weeks. 🙏
Thank you for your work. I did some testing with your implementation and it is robust and works pretty well !
However, for non-auto-regressive applications, the throughput is pretty much worse than regular transformers. In essence, the same parallel formulation can be used to generate a token (or a representation) by feeding the entire tokens without having to worry about keeping the states and looping through them.
Then in this case, what makes RetNet a successor to transformer ?