URL

https://arxiv.org/abs/2307.08621
Affiliations
- Yutao Sun, N/A
- Li Dong, N/A
- Shaohan Huang, N/A
- Shuming Ma, N/A
- Yuqing Xia, N/A
- Jilong Xue, N/A
- Jianyong Wang, N/A
- Furu Wei, N/A
  Abstract
- In this work, we propose Retentive Network (RetNet) as a foundationarchitecture for large language models, simultaneously achieving trainingparallelism, low-cost inference, and good performance. We theoretically derivethe connection between recurrence and attention. Then we propose the retentionmechanism for sequence modeling, which supports three computation paradigms,i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallelrepresentation allows for training parallelism. The recurrent representationenables low-cost $O(1)$ inference, which improves decoding throughput, latency,and GPU memory without sacrificing performance. The chunkwise recurrentrepresentation facilitates efficient long-sequence modeling with linearcomplexity, where each chunk is encoded parallelly while recurrentlysummarizing the chunks. Experimental results on language modeling show thatRetNet achieves favorable scaling results, parallel training, low-costdeployment, and efficient inference. The intriguing properties make RetNet astrong successor to Transformer for large language models. Code will beavailable at https://aka.ms/retnet.
  Translation (by gpt-3.5-turbo)
この研究では、大規模言語モデルの基盤アーキテクチャとしてRetentive Network（RetNet）を提案します。RetNetは、トレーニングの並列化、低コストの推論、および良好なパフォーマンスを同時に実現します。まず、再帰と注意の関係を理論的に導出します。次に、シーケンスモデリングのためのretentionメカニズムを提案します。このメカニズムは、並列、再帰、およびチャンクごとの再帰の3つの計算パラダイムをサポートします。具体的には、並列表現はトレーニングの並列化を可能にします。再帰表現は、低コストのO（1）推論を実現し、デコーディングのスループット、レイテンシ、およびGPUメモリを改善します。チャンクごとの再帰表現は、線形の計算量で効率的な長いシーケンスモデリングを可能にし、各チャンクを並列にエンコードしながら再帰的にチャンクを要約します。言語モデリングの実験結果は、RetNetが優れたスケーリング結果、並列トレーニング、低コストの展開、効率的な推論を実現していることを示しています。興味深い特性により、RetNetは大規模言語モデルにおけるTransformerの強力な後継者となります。コードはhttps://aka.ms/retnetで入手可能です。
Summary (by gpt-3.5-turbo)
この研究では、Retentive Network（RetNet）という大規模言語モデルのアーキテクチャを提案します。RetNetは、トレーニングの並列化、低コストの推論、良好なパフォーマンスを同時に実現することができます。RetNetは再帰と注意の関係を理論的に導出し、シーケンスモデリングのためのretentionメカニズムを提案します。このメカニズムは、並列、再帰、チャンクごとの再帰の3つの計算パラダイムをサポートします。RetNetの実験結果は、優れたスケーリング結果、並列トレーニング、低コストの展開、効率的な推論を実現していることを示しています。RetNetは、大規模言語モデルの強力な後継者となる可能性があります。

AkihikoWatanabe / paper_notes

Retentive Network: A Successor to Transformer for Large Language Models, Yutao Sun+, N/A, arXiv'23 #889

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)