AkihikoWatanabe commented 2 hours ago

URL

https://arxiv.org/abs/2411.08719
Authors
- Kazuki Fujii
- Taishi Nakamura
- Rio Yokota
  Abstract
- Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities, as well as their applicability across various domains. These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing. The Llama 3 series, for instance, exemplifies this trend with its flagship model boasting 405 billion parameters trained on 15.6 trillion tokens. The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process, particularly through the use of lower-precision formats. NVIDIA's H100 GPU, which introduces support for FP8 in addition to the more conventional FP16 and BF16 formats, has emerged as a focal point in this optimization effort. Preliminary studies suggest that FP8 could offer substantial reductions in training time without sacrificing model performance when compared to BF16, making it a promising candidate for large-scale model training. However, the broader implications of adopting FP8, particularly in terms of training stability and downstream task performance, have yet to be fully understood. In this study, we delve into the practical trade-offs involved in adopting FP8 over BF16 for training LLMs.
  Translation (by gpt-4o-mini)
大規模言語モデル（LLMs）は、その人間のような言語理解と生成能力、さまざまな分野への適用可能性から大きな注目を集めている。これらのモデルは、その巨大な規模と広範なトレーニングデータによって、自然言語処理における可能性の限界を押し広げ続けている。例えば、Llama 3シリーズは、15.6兆トークンでトレーニングされた4050億パラメータを持つフラッグシップモデルを特徴としており、この傾向を体現している。このようなモデルのトレーニングに伴う膨大な計算要求は、特に低精度フォーマットの使用を通じてトレーニングプロセスの効率を最適化する研究を促進している。NVIDIAのH100 GPUは、従来のFP16およびBF16フォーマットに加えてFP8のサポートを導入し、この最適化努力の中心的な存在となっている。初期の研究では、FP8がBF16と比較してモデルの性能を損なうことなくトレーニング時間を大幅に短縮できる可能性があることが示唆されており、大規模モデルのトレーニングにおける有望な候補となっている。しかし、特にトレーニングの安定性や下流タスクの性能に関して、FP8を採用することの広範な影響はまだ完全には理解されていない。本研究では、LLMsのトレーニングにおいてBF16に対するFP8の採用に関わる実際のトレードオフについて掘り下げる。
Summary (by gpt-4o-mini)
大規模言語モデル（LLMs）は、その言語理解能力と適用可能性から注目を集めており、特にLlama 3シリーズは4050億パラメータを持つ。トレーニングの効率化が求められる中、NVIDIAのH100 GPUはFP8フォーマットを導入し、トレーニング時間を短縮する可能性がある。初期研究ではFP8が性能を損なわずに効率を向上させることが示唆されているが、トレーニングの安定性や下流タスクへの影響はまだ不明である。本研究は、LLMsのトレーニングにおけるBF16とFP8のトレードオフを探る。

AkihikoWatanabe commented 2 hours ago

元ポスト:https://x.com/okoge_kaz/status/1857639065421754525?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe commented 2 hours ago

FP8で継続的事前学習をするとスループットは向上するが、lossのスパイクを生じたり、downstreamタスクの性能がBF16よりも低下したりする（日本語と英語の両方）との報告のようである。現状アブストと付録しか記載がないが、内容はこれから更新されるのだろうか。

AkihikoWatanabe / paper_notes

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs, Kazuki Fujii+, arXiv'24 #1524

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)