URL

https://arxiv.org/abs/2401.04088
Affiliations
- Albert Q. Jiang, N/A
- Alexandre Sablayrolles, N/A
- Antoine Roux, N/A
- Arthur Mensch, N/A
- Blanche Savary, N/A
- Chris Bamford, N/A
- Devendra Singh Chaplot, N/A
- Diego de las Casas, N/A
- Emma Bou Hanna, N/A
- Florian Bressand, N/A
- Gianna Lengyel, N/A
- Guillaume Bour, N/A
- Guillaume Lample, N/A
- Lélio Renard Lavaud, N/A
- Lucile Saulnier, N/A
- Marie-Anne Lachaux, N/A
- Pierre Stock, N/A
- Sandeep Subramanian, N/A
- Sophia Yang, N/A
- Szymon Antoniak, N/A
- Teven Le Scao, N/A
- Théophile Gervet, N/A
- Thibaut Lavril, N/A
- Thomas Wang, N/A
- Timothée Lacroix, N/A
- William El Sayed, N/A
  Abstract
- We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.Mixtral has the same architecture as Mistral 7B, with the difference that eachlayer is composed of 8 feedforward blocks (i.e. experts). For every token, ateach layer, a router network selects two experts to process the current stateand combine their outputs. Even though each token only sees two experts, theselected experts can be different at each timestep. As a result, each token hasaccess to 47B parameters, but only uses 13B active parameters during inference.Mixtral was trained with a context size of 32k tokens and it outperforms ormatches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, andmultilingual benchmarks. We also provide a model fine-tuned to followinstructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Boththe base and instruct models are released under the Apache 2.0 license.
  Translation (by gpt-3.5-turbo)
私たちは、Sparse Mixture of Experts（SMoE）言語モデルであるMixtral 8x7Bを紹介します。 Mixtralは、Mistral 7Bと同じアーキテクチャを持っていますが、各レイヤーは8つのフィードフォワードブロック（つまりエキスパート）で構成されています。各トークンごとに、各レイヤーでルーターネットワークが現在の状態を処理し、出力を組み合わせるために2つのエキスパートを選択します。各トークンは2つのエキスパートしか見ないが、選択されるエキスパートは各タイムステップで異なる可能性があります。その結果、各トークンは47Bのパラメータにアクセスできますが、推論時には13Bのアクティブなパラメータのみを使用します。 Mixtralは32kトークンのコンテキストサイズでトレーニングされ、すべての評価ベンチマークでLlama 2 70BとGPT-3.5を上回るか同等の性能を発揮します。特に、Mixtralは数学、コード生成、多言語のベンチマークでLlama 2 70Bを大幅に上回ります。また、指示に従うように微調整されたモデルであるMixtral 8x7B - Instructも提供しており、人間のベンチマークでGPT-3.5 Turbo、Claude-2.1、Gemini Pro、およびLlama 2 70B - チャットモデルを凌駕します。ベースモデルと指示モデルの両方はApache 2.0ライセンスの下で公開されています。
Summary (by gpt-3.5-turbo)
Mixtralは、Sparse Mixture of Experts（SMoE）言語モデルであり、各レイヤーが8つのフィードフォワードブロックで構成されています。 Mixtralは、トークンごとに2つのエキスパートを選択し、それらの出力を組み合わせます。 Mixtralは、Llama 2 70BとGPT-3.5を上回る性能を持ち、数学、コード生成、多言語のベンチマークで特に優れています。また、Mixtral 8x7B - Instructという指示に従うモデルも提供されており、人間のベンチマークを凌駕しています。

AkihikoWatanabe / paper_notes

Mixtral of Experts, Albert Q. Jiang+, N/A, arXiv'24 #1204

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)