artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.96k stars 820 forks source link

Could not reproduce the results listed in your paper using a single 3090 card. #264

Open LiZhangMing opened 1 year ago

LiZhangMing commented 1 year ago

Details: Here is your result :

image

I used the following commands to reproduce the results of using the LLaMA 7B model on the Guanaco (OASST1) dataset: CUDA_VISIBLE_DEVICES=2 sh scripts/finetunanaco_7b.sh and the best reslut is :

image
Joshuaclymer commented 1 year ago

A 1% difference is not a big deal comrade. That could be noise.

Forence1999 commented 11 months ago

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).

Does anyone have ideas about this? Thanks!

StiphyJay commented 11 months ago

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).

Does anyone have ideas about this? Thanks!

about alpaca dataset, how to set the hyperparameters?

Forence1999 commented 11 months ago

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4). Does anyone have ideas about this? Thanks!

about alpaca dataset, how to set the hyperparameters?

simply follow the bash file in the ./scripts/

Actually, from another issue in this project, one said that they evaluate in the MMLU test set, but the qlora.py reports the performance in MMLU dev set. Hence, you need to modify the py file to add the evaluation on test set. Then the results on alpaca and longform should be reproduced, which I've done.

Edenzzzz commented 4 months ago

@Forence1999 could you share how you reproduced it? I only got 32.1 with the original hyperparameters. Thanks!

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --use_auth \
    --output_dir /fly/results/qlora \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 16 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset alpaca \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --mmlu_split test
Forence1999 commented 4 months ago

@Forence1999 could you share how you reproduced it? I only got 32.1 with the original hyperparameters. Thanks!

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --use_auth \
    --output_dir /fly/results/qlora \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 16 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset alpaca \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --mmlu_split test

Hi, @Edenzzzz , so sorry for that I cannot provide my scripts anymore, because there has been a long time since I used it. It seems that so many params have been modified in your scripts.

Suggestions:

  1. use the scripts provided by the author, with modifications as less as possible.
  2. build a docker with the dockerfile provided by the author. Ensuring a consistent env is deadly important to reproduce the exact results, and it will also save u lots of time. If you don't wanna build from scratch, u can download my env (docker pull forence/open-instruct:v1) simply. Please note that I build this image with the author's dockerfile as reference, but not exactly follow it.

Hope this could help you a bit!