Reproduce the commense results on Boolq

AGI-Edgerunners / LLM-Adapters

Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"

https://arxiv.org/abs/2304.01933

Apache License 2.0

1k stars 91 forks source link

Reproduce the commense results on Boolq #64

Open Zhenyu001225 opened 2 months ago

Zhenyu001225 commented 2 months ago

When I'm doing the evaluation, should I use _--load8bit? I'm trying to reproduce the results of LLaMa-7B-LoRA

Finetune: CUDA_VISIBLE_DEVICES=8 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path './ft-training_set/commonsense_170k.json' --output_dir './trained_models/llama-7b-lora-commonsense/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Evaluate: CUDA_VISIBLE_DEVICES=3 python commonsense_evaluate.py \ --model LLaMA-7B \ --adapter LoRA \ --dataset boolq \ --batch_size 1 \ --base_model 'yahma/llama-7b-hf' \ --lora_weights './trained_models/llama-7b-lora-commonsense/'

But the result is only 57.5 compared with the table 68.9.. Could you provide me with some insights here?

Zhenyu001225 commented 2 months ago

And for PIQA the result is 74.6 compared with data in table 80.7. For Siqa the result is 60.8 compared with data in table 77.4 Should I finetune again? Or adjusting any of the hypermeters

lucasliunju commented 2 months ago

Hi May I ask whether you solve this issue now?

wutaiqiang commented 2 months ago

btw, I find that a larger batch size would lead to some bad output while bsz=1 not.

lucasliunju commented 2 months ago

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

wutaiqiang commented 2 months ago

In my case, the results are even better than reported. You should use one GPU in finetuning.

wutaiqiang commented 2 months ago

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

wutaiqiang commented 2 months ago

For llama 7B + lora

lucasliunju commented 2 months ago

Hi @wutaiqiang Thanks for your data point. I try to change the base model from "float16" to "float32" or "bfloat16" and I find the output result is not very stable.

Zhenyu001225 commented 2 months ago

Hi May I ask whether you solve this issue now?

Hi, I change the version of transformers to 4.35.0 and when doing evaluation batch_size=1.

Now the results are :

<!DOCTYPE html> Model	Gsm8k	SVAMP	AuQA	MultiArith	SingleEq	AddSub
LLama-7B-LoRA-math	37.9	47.0	19.68	97.5	85.83	83.54

<!DOCTYPE html> Model	BoolQ	SiQA	SIQA	Hellaswag	Winogrande	ARC-c	ARC-e	OpenBookQA	Average
LLama-7B-LoRA-Commense	64.01	80.25	77.28	76.50	79.79	62.54	77.31	77.4	74.39

Zhenyu001225 commented 2 months ago

For llama 7B + lora

Hi, what is the version of transformers in your case?

wutaiqiang commented 2 months ago

4.32.1

Zhenyu001225 commented 2 months ago

4.32.1

Thank you so much~ I'll try again

clarenceluo78 commented 2 months ago

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

Zhenyu001225 commented 2 months ago

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8 boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

I didn't open the 8-bit quantization

wutaiqiang commented 2 months ago

After rerun, the results are

68.13	80.3	78.45	83.11	80.66	77.23	65.78	79.4
boolq	piqa	social_i_qa	hellaswag	winogrande	ARC-Easy	ARC-Challenge	openbookqa

AaronZLT commented 3 weeks ago

@wutaiqiang have you tried the math finetuning?

wutaiqiang commented 2 weeks ago

Not yet @AaronZLT

wutaiqiang commented 2 weeks ago

btw, I find the results quite unstable, try for more times and you would get quite different results

AaronZLT commented 2 weeks ago

Hi, @lucasliunju @Zhenyu001225

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

Does the 'bsz=1' here means the batchsize in finetuning or evaluation? Since in general I think the batchsize in evaluation won't impact the eval result, otherwise it will be very weird. The evaluation result from this repo is quite different from the lm-eval-harness (which is the official eval repo used by huggingface-open-llm-leaderboard), the result of lm-eval is poor.

Btw, after finetuning on the dataset commonsense-170k, the performance on 5-shot mmlu drops, which is worse then the base model.

wutaiqiang commented 2 weeks ago

Yes, bsz means the batch size.

you ca refer to: https://openreview.net/pdf?id=9MDjKb9lGi

@AaronZLT

wutaiqiang commented 2 weeks ago

Also this one:

https://www.reddit.com/r/LocalLLaMA/comments/19dn2to/inconsistencies_in_llm_outputs_single_vs_batched/