OFA-Sys / gsm8k-ScRel

Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
https://arxiv.org/abs/2308.01825
215 stars 16 forks source link

Reproducing llama7b2-sft problem #12

Closed huijiawu0 closed 1 year ago

huijiawu0 commented 1 year ago

This issue is closed related to #9 and #8. However, after taking their insights into consideration, I still only achieve scores of 24.86 (llama-7b) and 26.99 (llama2-7b) on the gsm8k training set (7.4K, 3 epoch), 41.6% as mentioned in the paper. Here are the specifics:

Environment: Hardware: 2 X A100 80G GPUs Software: transformers==4.29.2

Training configuration: CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=2 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path $2 \ --bf16 True \ --output_dir $SAVE_PATH \ --num_train_epochs 1 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 32 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --gradient_checkpointing True

For both training and testing, I used the tokenizer from huggyllama/llama-7b. No significant issues were detected during the training process. However, I suspect some underlying differences in environment or methodology which may be causing this performance gap.

I would appreciate any insights or suggestions to help bridge this discrepancy and achieve the expected performance.

GanjinZero commented 1 year ago

--num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --gradient_accumulation_steps 16

This seems odd.

GanjinZero commented 1 year ago

Considering you have 2 GPUs, it should be:

--num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 16