AkihikoWatanabe commented 4 months ago

URL

https://arxiv.org/pdf/2310.06825
Affiliations
- Albert Q. Jiang, N/A
- Alexandre Sablayrolles, N/A
- Arthur Mensch, N/A
- Chris Bamford, N/A
- Devendra Singh Chaplot, N/A
- Diego de las Casas, N/A
- Florian Bressand, N/A
- Gianna Lengyel, N/A
- Guillaume Lample, N/A
- Lucile Saulnier, N/A
- Lélio Renard Lavaud, N/A
- Marie-Anne Lachaux, N/A
- Pierre Stock, N/A
- Teven Le Scao, N/A
- Thibaut Lavril, N/A
- Thomas Wang, N/A
- Timothée Lacroix, N/A
- William El Sayed, N/A
  Abstract
- We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineeredfor superior performance and efficiency. Mistral 7B outperforms Llama 2 13Bacross all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, andcode generation. Our model leverages grouped-query attention (GQA) for fasterinference, coupled with sliding window attention (SWA) to effectively handlesequences of arbitrary length with a reduced inference cost. We also provide amodel fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpassesthe Llama 2 13B -- Chat model both on human and automated benchmarks. Ourmodels are released under the Apache 2.0 license.
  Translation (by gpt-3.5-turbo)
Mistral 7B v0.1は、優れたパフォーマンスと効率を実現するために設計された70億パラメータの言語モデルであり、Mistral 7Bは、すべての評価ベンチマークでLlama 2 13Bを上回り、推論、数学、およびコード生成においてLlama 1 34Bを凌駕しています。当社のモデルは、高速な推論のためにグループ化されたクエリアテンション（GQA）を活用し、推論コストを削減しながら任意の長さのシーケンスを効果的に処理するためにスライディングウィンドウアテンション（SWA）を組み合わせています。また、指示に従うように微調整されたモデルであるMistral 7B -- Instructを提供し、これはLlama 2 13B -- Chatモデルを人間および自動化されたベンチマークの両方で上回っています。当社のモデルはApache 2.0ライセンスの下で公開されています。
Summary (by gpt-3.5-turbo)
Mistral 7B v0.1は、70億パラメータの言語モデルであり、高速な推論のためにGQAを活用し、SWAを組み合わせている。また、Mistral 7B -- InstructはLlama 2 13B -- Chatモデルを上回っており、Apache 2.0ライセンスの下で公開されています。

AkihikoWatanabe commented 4 months ago

1237 #1279 などのモデルも参照のこと

モデルのスケールが大きくなると、inferenceのlatencyが遅くなり、計算コストが大きくなりすぎて実用的でないので、小さいパラメータで素早いinference実現したいよね、というモチベーション。そのために、SlidingWindowAttentionとGroupQueryAttention #1271 を活用している。

より小さいパラメータ数でLlama2を様々なタスクでoutperformし

Instruction Tuningを実施したモデルは、13BモデルよりもChatbotArenaで高いElo Rateを獲得した。

AkihikoWatanabe commented 4 months ago

コンテキスト長は8192

AkihikoWatanabe / paper_notes

Mistral 7B, Albert Q. Jiang+, N/A, arXiv'23 #1309

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

1237 #1279 などのモデルも参照のこと