Jordan Hoffmann★, Sebastian Borgeaud★, Arthur Mensch★, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre★ (★Equal contributions)
DeepMind
Abstract
investigate the optimal model size and number of tokens for training a transformer language model
By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, find that for compute-optimal training, the model size and the number of training tokens should be scaled equally
모델 사이즈와 학습 토큰의 스케일은 비례함
for every doubling of model size the number of training tokens should also be doubled
test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data
Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
친칠라가 다운스트림태스크에서 다 이겼다?
Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher
Introduction
LLMs을 학습하면서 생기는 이슈들
accurately estimating the best model hyperparameters for a given compute budget is critical
Chinchilla는 이번에 알게된 연구 내용을 토대로 Gopher보다 모델 4배 줄이고 토큰 4배 늘렸다!
revisit the question: Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens?
400개 이상의 모델에 대해서 여러 파라미터와 데이터 사이즈로 실험해서 FLOPs(N,D)=C 제한 아래서 L(N, D)를 가장 낮추는 모델파라미터_N, 학습토큰_D에 대한 함수를 측정함
we predict that for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens
verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens
Figure 1
Figure A3
Estimating the optimal parameter/training tokens allocation
Research Question) Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
사실 난 이게 더 궁금하긴함
모델 파라미터와 토큰은 동일하게 비율적으로 올라가야
parameter count and number of training tokens should be increased equally with more compute3— with proportions reported in Table 2
3.1. Approach 1: Fix model sizes and vary number of training tokens
시간을 파라미터로 쓰기 애매하니까 FLOPs로 처리 해버린걸까? 왜 굳이 FLOPs를 써야하는지 의문이다
모델 사이즈 범위내에서 픽스해놓고 (ranging from 70M to over 10B parameters), FLOPs(𝑁, 𝐷) = 𝐶 측정
At 1500 logarithmically spaced FLOP values, we find which model size achieves the lowest loss of all models along with the required number of training tokens
fit power laws to estimate the optimal model size and number of training tokens for any given amount of compute
(see the center and right panels of Figure 2)
obtaining a relationship 𝑁𝑜𝑝𝑡 ∝ 𝐶^𝑎 and 𝐷𝑜𝑝𝑡 ∝ 𝐶^𝑏
We find that 𝑎 = 0.50 and 𝑏 = 0.50 —as summarized in Table 2.
3.2. Approach 2: IsoFLOP profiles
vary the model size for a fixed set of 9 different training FLOP counts (ranging from 6 × 1018 to 3 × 1021 FLOPs), and consider the final training loss for each point
in contrast with Approach 1 that considered points (𝑁, 𝐷, 𝐿) along the entire training runs. This allows us to directly answer the question: For a given FLOP budget, what is the optimal parameter count?
토큰이 많을 수록 Loss가 낮아진다 (같은 FLOPs 일지라도!)
fit a parabola to each IsoFLOPs curve to directly estimate at what model size the minimum loss is achieved (Figure 3 (left))
fit a power law between FLOPs and loss-optimal model size and number of training tokens, shown in
Figure 3 (center, right). Again, we fit exponents of the form 𝑁𝑜𝑝𝑡 ∝ 𝐶^𝑎 and 𝐷𝑜𝑝𝑡 ∝ 𝐶^𝑏. 𝑎 = 0.49 and 𝑏 = 0.51—as summarized in Table 2.
3.3. Approach 3: Fitting a parametric loss function
model all final losses from experiments in Approach 1 & 2 as a parametric function of model parameter count and the number of seen tokens
3.4. Optimal model scaling
기존 논문(Kaplan et al.(2020) 과는 달리 파라미터와 데이터가 거의 equal한 스케일링을 보임
사실 이 논문에서는 아래표가 제일 중요했다
4. Chinchilla
4.1. Model and training details
train Chinchilla on MassiveText (the same dataset as Gopher) but use a slightly different subset distribution (shown in Table A1) to account for the increased number of training tokens
AdamW (Loshchilov and Hutter, 2019) for Chinchilla
train Chinchilla with a slightly modified SentencePiece (Kudo and Richardson, 2018)
tokenizer that does not apply NFKC normalisation (이거 왜 안했지?)
find that this particularly helps with the representation of mathematics and chemistry (MMLU 같은 곳에는 도움될수도?!)
vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher
forward and backward pass are computed in bfloat16, we store a float32 copy of the weights
4.2. Results
4.2.1. Lanugage modeling
bits-per-byte(bpb)가 뭐지
4.2.2. MMLU
이하 생략
Discussion & Conclusion
The trend so far in large language model training has been to increase the model size, often without increasing the number of training tokens
모델크기만 키우고 토큰은 안키웠던 트렌드가 있었음, 근데 잘못됨
propose three predictive approaches towards optimally setting model size and training dura- tion, based on the outcome of over 400 training runs
실험 많이함
Though there has been significant recent work allowing larger and larger models to be trained, our analysis suggests an increased focus on dataset scaling is needed
데이터 스케일링도 필요하다고! (물론 퀄리티가 뒷받침되야함)
Larger datasets will require extra care to ensure train-test set overlap is properly accounted for, both in the language modelling loss but also with downstream tasks
LM loss와 downstream task 다 잘되게 하려면 데이터 양 신경쓰자
Chinchilla does suffer from bias and toxicity but interestingly it seems less affected than Gopher
친칠라도 bias와 toxicity 문제가 있었지만 고퍼보다 덜했다.
Appendix
학습셋
D.3. Predicted compute optimal frontier for all three methods
Note
Author
Abstract
Chinchilla
, that uses thesame compute budget as Gopher
but with70B parameters and 4× more more data
67.5% on the MMLU benchmark
, greater than a 7% improvement over GopherIntroduction
compute-optimal 70B model
, called Chinchilla, on 1.4 trillion tokensEstimating the optimal parameter/training tokens allocation
Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
3.1. Approach 1: Fix model sizes and vary number of training tokens
achieves the lowest loss
of all modelsalong with the required number of training tokens
3.2. Approach 2: IsoFLOP profiles
fixed set of 9 different training FLOP counts
(ranging from 6 × 1018 to 3 × 1021 FLOPs), and consider the final training loss for each point3.3. Approach 3: Fitting a parametric loss function
3.4. Optimal model scaling
기존 논문(Kaplan et al.(2020) 과는 달리 파라미터와 데이터가 거의 equal한 스케일링을 보임
사실 이 논문에서는 아래표가 제일 중요했다
4. Chinchilla
4.1. Model and training details
does not apply NFKC normalisation
(이거 왜 안했지?)mathematics and chemistry
(MMLU 같은 곳에는 도움될수도?!)computed in bfloat16
, westore a float32
copy of the weights4.2. Results
4.2.1. Lanugage modeling
4.2.2. MMLU
Discussion & Conclusion
Appendix
학습셋
D.3. Predicted compute optimal frontier for all three methods