How's this RetNet useful when throughput is actually lower?

fkodom / yet-another-retnet

A simple but robust PyTorch implementation of RetNet from "Retentive Network: A Successor to Transformer for Large Language Models" (https://arxiv.org/pdf/2307.08621.pdf)

MIT License

100 stars 15 forks source link

How's this RetNet useful when throughput is actually lower? #8

Closed achen46 closed 1 year ago

achen46 commented 1 year ago

Thank you for your work. I did some testing with your implementation and it is robust and works pretty well !

However, for non-auto-regressive applications, the throughput is pretty much worse than regular transformers. In essence, the same parallel formulation can be used to generate a token (or a representation) by feeding the entire tokens without having to worry about keeping the states and looping through them.

Then in this case, what makes RetNet a successor to transformer ?

fkodom commented 1 year ago

@achen46 You're correct -- RetNet does not provide any benefits for non-autoregressive tasks. The primary innovation is the dual parallel/recurrent formulation. If you don't need recurrent (autoregressive) updates, the parallel formulation is essentially just a Transformer variant.

fkodom commented 1 year ago

P.S. Sorry for the late response -- have been extremely busy the past few weeks. 🙏