In this work, we propose Retentive Network (RetNet) as a foundationarchitecture for large language models, simultaneously achieving trainingparallelism, low-cost inference, and good performance. We theoretically derivethe connection between recurrence and attention. Then we propose the retentionmechanism for sequence modeling, which supports three computation paradigms,i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallelrepresentation allows for training parallelism. The recurrent representationenables low-cost $O(1)$ inference, which improves decoding throughput, latency,and GPU memory without sacrificing performance. The chunkwise recurrentrepresentation facilitates efficient long-sequence modeling with linearcomplexity, where each chunk is encoded parallelly while recurrentlysummarizing the chunks. Experimental results on language modeling show thatRetNet achieves favorable scaling results, parallel training, low-costdeployment, and efficient inference. The intriguing properties make RetNet astrong successor to Transformer for large language models. Code will beavailable at https://aka.ms/retnet.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)