Questions about evaluate time

Yuan0320 commented 1 year ago

Hi @HZQ950419, thanks for your great work! Here I wonder how long your evaluation phase took? I used a single V100, and the evaluation phase seems a bit time consuming, e.g. I spent about 5 hours on the AddSub test set. Is this normal?

  0%|                                                                                                                                                                                                                                                     | 0/395 [00:00<?, ?it/s]
---------------
A: There were 7 crayons in the drawer. Mary took 3 out, so now there are 7 - 3 = 4 crayons in the drawer. The answer is 4.<unk>� (Note: The answer may not always be so simple. In this case, the answer is 4 because there are only 4 crayons in the drawer. If there were 10 crayons in the drawer, the answer would be 7 - 3 = 4. If there were 100 crayons in the drawer, the answer would be 7 - 3 = 4. If there were 1000 crayons in the drawer, the answer would be 7 - 3 = 4. In general, the answer is 7 - x = 4, where x is the number of crayons in the drawer. The answer is 4 in this case because there are 4 crayons in the drawer. The answer may not always be so simple. In this case, the answer is 4 because there are only 4 crayons in the drawer. If there were 10 c
prediction: 10.0
label: 4.0
---------------
test:1/395 | accuracy 0  0.0
  0%|▌                                                                                                                                                                                                                                          | 1/395 [00:54<5:58:23, 54.58s/it]
---------------
A: There are 3 gallons in the bucket. Derek adds 6.8 gallons more. So in total there are 3 + 6.8 = 9.8 gallons. The answer is 9.8 gallons.<unk>
<unk>]{' Instruction:', 'Response:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:', 'Rationale:',
prediction: 9.8
label: 9.8
---------------
test:2/395 | accuracy 1  0.5
  1%|█▏                                                                                                                                                                                                                                         | 2/395 [01:46<5:46:58, 52.97s/it]

HZQ950419 commented 1 year ago

Hi,

Yes, the evaluation will take a very long time. The gsm8k will take around 20+ hours for V100 32G without load_8bit. Please let us know if you have further questions!

Yuan0320 commented 1 year ago

Thanks for the response! I ran the experiments and found that for some datasets, the evaluation accuracy differed a bit more than the results reported in the paper (e.g. AddSub 86.58 vs. 78.5) , which I trained using math_data.json. Do you know why? Maybe the envs, GPUs? I'm not sure if I'm going wrong somewhere.

Model	MultiArith	GSM8K	AddSub	AQuA	SingleEq	SVAMP
LLaMa-7B + LoRA (paper reported)	88.3	21.9	78.5	27.5	83.3	54.5
LLaMa-7B + LoRA (my reproduced)	85.83	25.85	86.6	17.32	84.65	65.4

HZQ950419 commented 1 year ago

Hi,

please use the following command to reproduce the result for LLaMA-7B-LoRA.

CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Yuan0320 commented 1 year ago

Thanks! This helps me a lot. Currently the training dataset, e.g., math_data.json, covers all six math reasoning datasets, I wonder that can we get the training data for each datasets? Or other ways to get the dataset to which each sample in the training data belongs.

AGI-Edgerunners / LLM-Adapters

Questions about evaluate time #40