We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.Mixtral has the same architecture as Mistral 7B, with the difference that eachlayer is composed of 8 feedforward blocks (i.e. experts). For every token, ateach layer, a router network selects two experts to process the current stateand combine their outputs. Even though each token only sees two experts, theselected experts can be different at each timestep. As a result, each token hasaccess to 47B parameters, but only uses 13B active parameters during inference.Mixtral was trained with a context size of 32k tokens and it outperforms ormatches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, andmultilingual benchmarks. We also provide a model fine-tuned to followinstructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Boththe base and instruct models are released under the Apache 2.0 license.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)