Open Yuan0320 opened 1 year ago
Hi,
Yes, the evaluation will take a very long time. The gsm8k will take around 20+ hours for V100 32G without load_8bit. Please let us know if you have further questions!
Thanks for the response! I ran the experiments and found that for some datasets, the evaluation accuracy differed a bit more than the results reported in the paper (e.g. AddSub 86.58 vs. 78.5) , which I trained using math_data.json
. Do you know why? Maybe the envs, GPUs? I'm not sure if I'm going wrong somewhere.
Model | MultiArith | GSM8K | AddSub | AQuA | SingleEq | SVAMP |
---|---|---|---|---|---|---|
LLaMa-7B + LoRA (paper reported) | 88.3 | 21.9 | 78.5 | 27.5 | 83.3 | 54.5 |
LLaMa-7B + LoRA (my reproduced) | 85.83 | 25.85 | 86.6 | 17.32 | 84.65 | 65.4 |
Hi,
please use the following command to reproduce the result for LLaMA-7B-LoRA.
CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64
Thanks! This helps me a lot. Currently the training dataset, e.g., math_data.json
, covers all six math reasoning datasets, I wonder that can we get the training data for each datasets? Or other ways to get the dataset to which each sample in the training data belongs.
Hi @HZQ950419, thanks for your great work! Here I wonder how long your evaluation phase took? I used a single V100, and the evaluation phase seems a bit time consuming, e.g. I spent about 5 hours on the AddSub test set. Is this normal?