URL

https://arxiv.org/abs/2404.05405
Affiliations
- Zeyuan Allen-Zhu, N/A
- Yuanzhi Li, N/A
  Abstract
- Scaling laws describe the relationship between the size of language modelsand their capabilities. Unlike prior studies that evaluate a model's capabilityvia loss or benchmarks, we estimate the number of knowledge bits a modelstores. We focus on factual knowledge represented as tuples, such as (USA,capital, Washington D.C.) from a Wikipedia page. Through multiple controlleddatasets, we establish that language models can and only can store 2 bits ofknowledge per parameter, even when quantized to int8, and such knowledge can beflexibly extracted for downstream applications. Consequently, a 7B model canstore 14B bits of knowledge, surpassing the English Wikipedia and textbookscombined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) modelarchitecture, (3) quantization, (4) sparsity constraints such as MoE, and (5)data signal-to-noise ratio affect a model's knowledge storage capacity. Notableinsights include: The GPT-2 architecture, with rotary embedding, matches or even surpassesLLaMA/Mistral architectures in knowledge storage, particularly over shortertraining durations. This arises because LLaMA/Mistral uses GatedMLP, which isless stable and harder to train. Prepending training data with domain names (e.g., wikipedia.org)significantly increases a model's knowledge capacity. Language models canautonomously identify and prioritize domains rich in knowledge, optimizingtheir storage capacity.
  Translation (by gpt-3.5-turbo)
スケーリング則は、言語モデルのサイズとその能力との関係を記述します。従来の研究とは異なり、モデルの能力を損失やベンチマークで評価するのではなく、モデルが格納する知識ビット数を推定します。私たちは、(USA, capital, Washington D.C.)のようなタプルで表される事実知識に焦点を当てており、これはWikipediaページから取得されます。複数の制御されたデータセットを通じて、言語モデルは1つのパラメータあたり2ビットの知識を格納できること、int8に量子化されていてもそのような知識を柔軟に抽出できることを確立しています。その結果、7Bモデルは14Bビットの知識を格納でき、私たちの推定に基づいて英語のWikipediaと教科書を合わせたものを上回ります。さらに、(1)トレーニング期間、(2)モデルアーキテクチャ、(3)量子化、(4)MoEなどの疎な制約、および(5)データの信号対雑音比がモデルの知識格納容量にどのように影響するかに関する12の結果を提示しています。注目すべき洞察には以下が含まれます：
ロータリー埋め込みを使用したGPT-2アーキテクチャは、短いトレーニング期間において特に知識の格納においてLLaMA/Mistralアーキテクチャと匹敵するか、それを上回ることがあります。これは、LLaMA/MistralがGatedMLPを使用しており、より安定性が低く、トレーニングが難しいためです。
トレーニングデータをドメイン名（例：wikipedia.org）で先頭に追加すると、モデルの知識容量が大幅に増加します。言語モデルは、知識が豊富なドメインを自律的に特定し、優先順位付けすることで、その格納容量を最適化できます。
Summary (by gpt-3.5-turbo)
言語モデルのサイズと能力の関係を記述するスケーリング則に焦点を当てた研究。モデルが格納する知識ビット数を推定し、事実知識をタプルで表現。言語モデルは1つのパラメータあたり2ビットの知識を格納可能であり、7Bモデルは14Bビットの知識を格納可能。さらに、トレーニング期間、モデルアーキテクチャ、量子化、疎な制約、データの信号対雑音比が知識格納容量に影響することを示唆。ロータリー埋め込みを使用したGPT-2アーキテクチャは、知識の格納においてLLaMA/Mistralアーキテクチャと競合する可能性があり、トレーニングデータにドメイン名を追加すると知識容量が増加することが示された。

AkihikoWatanabe / paper_notes

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, Zeyuan Allen-Zhu+, N/A, arXiv'24 #1286

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)