URL

https://arxiv.org/pdf/2411.04330
Authors
- Tanishq Kumar
- Zachary Ankner
- Benjamin F. Spector
- Blake Bordelon
- Niklas Muennighoff
- Mansheej Paul
- Cengiz Pehlevan
- Christopher Ré
- Aditi Raghunathan
  Abstract
- Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.
  Translation (by gpt-4o-mini)
低精度のトレーニングと推論は、言語モデルの品質とコストの両方に影響を与えますが、現在のスケーリング法則はこれを考慮していません。本研究では、トレーニングと推論の両方に対して「精度を考慮した」スケーリング法則を考案します。低精度でのトレーニングはモデルの「実効パラメータ数」を減少させると提案し、低精度でのトレーニングとポストトレーニング量子化から生じる追加の損失を予測できるようにします。推論に関しては、ポストトレーニング量子化によって引き起こされる劣化は、モデルがより多くのデータでトレーニングされるにつれて増加し、最終的には追加の事前トレーニングデータが逆に有害になることがわかりました。トレーニングにおいては、我々のスケーリング法則は異なる精度で異なる部分を持つモデルの損失を予測できるようにし、低精度でより大きなモデルをトレーニングすることが計算上最適である可能性を示唆します。ポストトレーニング量子化と事前トレーニング量子化のスケーリング法則を統一し、異なる精度でのトレーニングと推論からの劣化を予測する単一の関数形式に到達します。465回以上の事前トレーニングの実行に基づいてフィッティングを行い、最大1.7Bパラメータのモデルサイズで最大26Bトークンのトレーニングにおける予測を検証します。
Summary (by gpt-4o-mini)
本研究では、低精度のトレーニングと推論が言語モデルの品質に与える影響を考慮した「精度を考慮した」スケーリング法則を提案。低精度トレーニングが実効パラメータ数を減少させ、ポストトレーニング量子化による劣化がトレーニングデータの増加とともに悪化することを示す。異なる精度でのモデル損失を予測し、低精度での大規模モデルのトレーニングが最適である可能性を示唆。スケーリング法則を統一し、実験に基づいて予測を検証。

AkihikoWatanabe / paper_notes

Scaling Laws for Precision, Tanishq Kumar+, arXiv'24 #1512

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)