AkihikoWatanabe commented 2 days ago

URL

https://arxiv.org/abs/2406.15786
Affiliations
- Shwai He, N/A
- Guoheng Sun, N/A
- Zheyu Shen, N/A
- Ang Li, N/A
  Abstract
- While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.
  Translation (by gpt-4o-mini)
Transformerベースの大規模言語モデル（LLMs）のスケーリングは、さまざまなタスクで有望なパフォーマンスを示していますが、同時に冗長なアーキテクチャを導入し、実世界での展開における効率性の課題を引き起こします。LLMsにおける冗長性の認識はあるものの、MLPやAttention層など、トランスフォーマーの異なるアーキテクチャ間での冗長性の変動については十分に探求されていません。本研究では、Blocks、MLP、Attention層を含むトランスフォーマー内の異なるモジュール間の冗長性を、類似性に基づく指標を用いて調査します。驚くべきことに、トランスフォーマーを他のアーキテクチャと区別する上で重要な役割を果たすAttention層にもかかわらず、これらの層の大部分が過度に高い類似性を示し、パフォーマンスを低下させることなくプルーニングできることがわかりました。例えば、Llama-2-70Bは、Attention層の半分をプルーニングすることで48.4%のスピードアップを達成し、パフォーマンスはわずか2.4%低下しました。さらに、トレーニングプロセス全体でモデルのチェックポイントを追跡することで、Attention層の冗長性が内在的であり、トレーニング段階を通じて一貫していることを観察しました。加えて、Attention層とMLP層を同時に削除する方法を提案し、より積極的に追加の層を削除できるようにしました。例えば、31層（Attention + MLP）を削除しても、Llama-2-13BはMMLUタスクで90%のパフォーマンスを維持します。本研究は、今後のネットワークアーキテクチャ設計に貴重な洞察を提供します。コードは以下で公開されています: \url{https://github.com/Shwai-He/LLM-Drop}。
Summary (by gpt-4o-mini)
本研究では、トランスフォーマー内のBlocks、MLP、Attention層間の冗長性を調査し、Attention層の高い類似性によりプルーニングが可能であることを示しました。具体的には、Llama-2-70BではAttention層の半分を削除することで48.4%のスピードアップを達成し、パフォーマンスはわずか2.4%低下しました。また、Attention層とMLP層を同時に削除する手法を提案し、31層削除してもLlama-2-13Bは90%のパフォーマンスを維持しました。これにより、今後のネットワークアーキテクチャ設計に貴重な洞察を提供します。

AkihikoWatanabe commented 2 days ago

通常LLMはtransformer decoderのブロックをstackすることで形成されるが、積み上げたブロック、あるいはlayerってほんとに全部必要なの?という疑問に答えてくれる論文のようである。

transformer blockそのもの、あるいはMLP layerを削除するとpeformanceは大幅に低下するが、attention layerを削除してもperformanceの低下が起きなかった模様。これにより高速化が実現可能。

削除するブロックやlayerはinputとoutputのコサイン類似度が高いものを削除することによって実現。

比較的パラメータサイズが小さい7B, 13Bモデルでの実験結果

より大きなモデルでの実験結果

AkihikoWatanabe commented 2 days ago

パフォーマンスが変わらない範囲だと、attention layer dropにより、7B, 13Bモデルの場合は23%程度、70Bの場合は35%のスループット向上

AkihikoWatanabe / paper_notes

What Matters in Transformers? Not All Attention is Needed, Shwai He+, N/A, arXiv'24 #1467

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)