Closed huijiawu0 closed 1 year ago
--num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --gradient_accumulation_steps 16
This seems odd.
Considering you have 2 GPUs, it should be:
--num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 16
This issue is closed related to #9 and #8. However, after taking their insights into consideration, I still only achieve scores of 24.86 (llama-7b) and 26.99 (llama2-7b) on the gsm8k training set (7.4K, 3 epoch), 41.6% as mentioned in the paper. Here are the specifics:
Environment: Hardware: 2 X A100 80G GPUs Software: transformers==4.29.2
Training configuration: CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=2 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path $2 \ --bf16 True \ --output_dir $SAVE_PATH \ --num_train_epochs 1 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 32 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --gradient_checkpointing True
For both training and testing, I used the tokenizer from huggyllama/llama-7b. No significant issues were detected during the training process. However, I suspect some underlying differences in environment or methodology which may be causing this performance gap.
I would appreciate any insights or suggestions to help bridge this discrepancy and achieve the expected performance.