Open pad9153 opened 5 months ago
Yes, above results were from 3 runs of the same yaml file (i.e., same model config, dataset, training params, random-seed, etc.) except for the change of experiment_id. The general setting is:
tokenizer: /cos_ablation/tokenizers/bigcode_starcoder
max_seq_len: 8192
vocab_size: 49152
seed: 42
save_steps: 5000
max_steps: 35000
do_lmeval: True
learning_rate: 6e-4
max_batch_len: 2
num_nodes: 8
use_profiler: "False"
eos_token: "0"
bos_token: "None"
logical_shards: 640
We observed noticeable variability when re-running the FSDP model training script for a small 1.xB llama2 model with fixed seed(s) and same tokens. Below is a snapshot of the evaluation results on three models created with the same inputs (tokens, training script, seed(s)). Would you please help us investigate the root cause of this variability (data loader, hardware variability or other additional variables)? Thanks in advance!