Problems I came across when I try to reprocude the results

ChaoGaoUCR commented 1 year ago

Dear Authors,

Thanks for these great projects and your kind help. I try to reproduce all the results in the Table, But I came across several issues, Could you please explain some possible problems? 1. When I tried to Tune the model, I found the function "generate_prompt" in both finetune.py and evaluation.py can't extract data from the JSON file which tile is not "input, instruction, output, answer", So I changed the JSON file all the input name, I was wondering whether I am doing the right Jobs. Here are two examples I used. Original One in Github Repo: The One I modified: 2. I can't get an answer which even close to the right Label Since I wasn't working in the ML area before, all the metrics are new to me, But the tuned model gives some results which sound ridiculous to me, I wondered if I did something wrong, or is there any other new Metrics I should use to reproduce the Tune-model Score in The GitHub repo? I attached some results I got from my LoRA-Tuned model: BTW, When I switch the test datasets to train datasets, the accuracy get higher, but still not the same as Table list. I wondered if you can share your tuning setting if possible.

HZQ950419 commented 1 year ago

Hi,

If you want to reproduce all the results in the table, you can just train and evaluate with the given command. For example, to train LLaMA-7b-LoRA, you can use CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora

For evaluation on SVAMP as example: CUDA_VISIBLE_DEVICES=0 python evaluate.py \ --model LLaMA-7B \ --adapter LoRA \ --dataset SVAMP \ --base_model 'yahma/llama-7b-hf' \ --lora_weights "./trained_models/llama-7b-lora-math"

If you have any questions, please let use know!

lucasliunju commented 7 months ago

Hi @HZQ950419

Thanks for your great work, I also have a problem when I try to evaluate the fine-tuned model with lora. I find the main reason is that the output of response is none, for example:

The output on BoolQ is: 

outputs:  ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n                ### Instruction:\n                Please answer the following question with true or false, question: do runs have to be the same suit in gin rummy?\n\nAnswer format: true/false\n\n                ### Response:\n                ']
output:  
Please answer the following question with true or false, question: do runs have to be the same suit in gin rummy?

Answer format: true/false

prediction: 
label: true
---------------
test:2637/3270 | accuracy 0  0.0
---------------

ZeguanXiao commented 6 months ago

Same problem.

AGI-Edgerunners / LLM-Adapters

Problems I came across when I try to reprocude the results #37