Repeatability of Small Model Training Script with fixed seed(s) and same dataset

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

192 stars 32 forks source link

Repeatability of Small Model Training Script with fixed seed(s) and same dataset #92

Open pad9153 opened 5 months ago

pad9153 commented 5 months ago

We observed noticeable variability when re-running the FSDP model training script for a small 1.xB llama2 model with fixed seed(s) and same tokens. Below is a snapshot of the evaluation results on three models created with the same inputs (tokens, training script, seed(s)). Would you please help us investigate the root cause of this variability (data loader, hardware variability or other additional variables)? Thanks in advance!

dangxuanhong commented 5 months ago

Yes, above results were from 3 runs of the same yaml file (i.e., same model config, dataset, training params, random-seed, etc.) except for the change of experiment_id. The general setting is:

tokenizer: /cos_ablation/tokenizers/bigcode_starcoder
max_seq_len: 8192
vocab_size: 49152
seed: 42
save_steps: 5000
max_steps: 35000
do_lmeval: True
learning_rate: 6e-4
max_batch_len: 2
num_nodes: 8
use_profiler: "False"
eos_token: "0"
bos_token: "None"
logical_shards: 640