URL

https://arxiv.org/abs//2305.13048
Affiliations
- Bo Peng, N/A
- Eric Alcaide, N/A
- Quentin Anthony, N/A
- Alon Albalak, N/A
- Samuel Arcadinho, N/A
- Huanqi Cao, N/A
- Xin Cheng, N/A
- Michael Chung, N/A
- Matteo Grella, N/A
- Kranthi Kiran GV, N/A
- Xuzheng He, N/A
- Haowen Hou, N/A
- Przemyslaw Kazienko, N/A
- Jan Kocon, N/A
- Jiaming Kong, N/A
- Bartlomiej Koptyra, N/A
- Hayden Lau, N/A
- Krishna Sri Ipsit Mantri, N/A
- Ferdinand Mom, N/A
- Atsushi Saito, N/A
- Xiangru Tang, N/A
- Bolun Wang, N/A
- Johan S. Wind, N/A
- Stansilaw Wozniak, N/A
- Ruichong Zhang, N/A
- Zhenyuan Zhang, N/A
- Qihang Zhao, N/A
- Peng Zhou, N/A
- Jian Zhu, N/A
- Rui-Jie Zhu, N/A
  Abstract
- Transformers have revolutionized almost all natural language processing (NLP)tasks but suffer from memory and computational complexity that scalesquadratically with sequence length. In contrast, recurrent neural networks(RNNs) exhibit linear scaling in memory and computational requirements butstruggle to match the same performance as Transformers due to limitations inparallelization and scalability. We propose a novel model architecture,Receptance Weighted Key Value (RWKV), that combines the efficientparallelizable training of Transformers with the efficient inference of RNNs.Our approach leverages a linear attention mechanism and allows us to formulatethe model as either a Transformer or an RNN, which parallelizes computationsduring training and maintains constant computational and memory complexityduring inference, leading to the first non-transformer architecture to bescaled to tens of billions of parameters. Our experiments reveal that RWKVperforms on par with similarly sized Transformers, suggesting that future workcan leverage this architecture to create more efficient models. This workpresents a significant step towards reconciling the trade-offs betweencomputational efficiency and model performance in sequence processing tasks.
  Translation (by gpt-3.5-turbo)
トランスフォーマーは、ほとんどすべての自然言語処理（NLP）タスクを革命的に変えましたが、シーケンスの長さに比例して二次的にスケールするメモリと計算の複雑さに苦しんでいます。一方、再帰型ニューラルネットワーク（RNN）は、メモリと計算要件が線形にスケールするため効率的ですが、並列化とスケーラビリティの制限により、トランスフォーマーと同等のパフォーマンスには達しにくいという課題があります。本研究では、Receptance Weighted Key Value（RWKV）という新しいモデルアーキテクチャを提案し、トランスフォーマーの効率的な並列化トレーニングとRNNの効率的な推論を組み合わせます。当アプローチは、線形アテンションメカニズムを活用し、モデルをトランスフォーマーまたはRNNとして定式化することができ、トレーニング中に計算を並列化し、推論中に一定の計算およびメモリの複雑さを維持することができます。これにより、最初の非トランスフォーマーアーキテクチャが数百億のパラメータにスケールされました。実験の結果、RWKVは同じサイズのトランスフォーマーと同等のパフォーマンスを発揮し、将来的にはより効率的なモデルを作成するためにこのアーキテクチャを活用できることを示唆しています。本研究は、シーケンス処理タスクにおける計算効率とモデルパフォーマンスのトレードオフを調和させるための重要な一歩を示しています。
Summary (by gpt-3.5-turbo)
本研究では、トランスフォーマーとRNNの両方の利点を組み合わせた新しいモデルアーキテクチャであるRWKVを提案し、トレーニング中に計算を並列化し、推論中に一定の計算およびメモリの複雑さを維持することができます。RWKVは、同じサイズのトランスフォーマーと同等のパフォーマンスを発揮し、将来的にはより効率的なモデルを作成するためにこのアーキテクチャを活用できることを示唆しています。

AkihikoWatanabe / paper_notes

RWKV: Reinventing RNNs for the Transformer Era, Bo Peng+, N/A, arXiv'23 #765

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)