URL

https://arxiv.org/abs/2411.04996
Authors
- Weixin Liang
- Lili Yu
- Liang Luo
- Srinivasan Iyer
- Ning Dong
- Chunting Zhou
- Gargi Ghosh
- Mike Lewis
- Wen-tau Yih
- Luke Zettlemoyer
- Xi Victoria Lin
  Abstract
- The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
  Translation (by gpt-4o-mini)
大規模言語モデル（LLMs）の開発は、テキスト、画像、音声を統一されたフレームワーク内で処理できるマルチモーダルシステムに拡大しています。これらのモデルのトレーニングには、テキストのみのLLMと比較して、はるかに大きなデータセットと計算リソースが必要です。スケーリングの課題に対処するために、我々はMixture-of-Transformers（MoT）を導入します。これは、事前トレーニングの計算コストを大幅に削減するスパースなマルチモーダルトランスフォーマーアーキテクチャです。MoTは、フィードフォワードネットワーク、アテンションマトリックス、レイヤーノーマライゼーションを含むモデルの非埋め込みパラメータをモダリティごとに分離し、全入力シーケンスに対するグローバルな自己注意を用いたモダリティ特有の処理を可能にします。MoTを複数の設定とモデルスケールで評価しました。Chameleon 7B設定（自己回帰的なテキストと画像の生成）では、MoTは55.8%のFLOPsのみを使用して、密なベースラインのパフォーマンスに匹敵します。音声を含むように拡張した場合、MoTは37.2%のFLOPsで密なベースラインに匹敵する音声パフォーマンスを達成します。テキストと画像が異なる目的でトレーニングされるTransfusion設定では、7BのMoTモデルが密なベースラインの画像モダリティパフォーマンスにFLOPsの3分の1で匹敵し、760MのMoTモデルは主要な画像生成指標で1.4Bの密なベースラインを上回ります。システムプロファイリングは、MoTの実用的な利点をさらに強調し、密なベースラインの画像品質を47.2%の経過時間で、テキスト品質を75.6%の経過時間で達成しました（NVIDIA A100 GPUを搭載したAWS p4de.24xlargeインスタンスで測定）。
Summary (by gpt-4o-mini)
大規模言語モデル（LLMs）のマルチモーダル処理を効率化するために、Mixture-of-Transformers（MoT）を提案。MoTは計算コストを削減し、モダリティごとにパラメータを分離して特化した処理を実現。Chameleon 7B設定では、55.8%のFLOPsで密なベースラインに匹敵する性能を示し、音声を含む場合も37.2%のFLOPsで同様の結果を達成。さらに、Transfusion設定では、7BのMoTモデルが密なベースラインの画像性能に対してFLOPsの3分の1で匹敵し、760Mのモデルは主要な画像生成指標で上回る結果を得た。MoTは実用的な利点も示し、画像品質を47.2%、テキスト品質を75.6%の経過時間で達成。

AkihikoWatanabe / paper_notes

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, Weixin Liang+, arXiv'24 #1505

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)