aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
176 stars 74 forks source link

Example 10.FSDP reports 35b model created instead of 70b #296

Open verdimrc opened 4 months ago

verdimrc commented 4 months ago

The README recommends these hyperparameters to train a 70b model:

--num_key_value_heads=8
--llama_intermediate_size=28672
--hidden_width=8192
--num_layers=80
--num_heads=64

but the train script reports that it creates 35B model instead:

0: 2024-04-16 11:50:01 I [train.py:155] Creating Model
0: 2024-04-16 11:58:16 I [train.py:162] Created model with total parameters: 34549800960 (34.55 B)

Full command:

srun -l ./pt_fsdp_haha/bin/torchrun --nproc_per_node=8 --nnodes=2 --rdzv_id=324 --rdzv_backend=c10d --rdzv_endpoint=p4de-st-p4de-1 ./train.py --num_key_value_heads=8 --llama_intermediate_size=28672 --hidden_width=8192 --num_layers=80 --num_heads=64 --checkpoint_dir=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/chkpts --max_context_width=4096 --model_type=llama_v2 --tokenizer=hf-internal-testing/llama-tokenizer --checkpoint_freq=1 --validation_freq=500 --max_steps=4 --epochs=1 --dataset=c4 --dataset_config_name=en --train_batch_size=1 --val_batch_size=1 --sharding_strategy=full --offload_activations=1
github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.