Open ChaoGaoUCR opened 1 year ago
Hi,
If you want to reproduce all the results in the table, you can just train and evaluate with the given command. For example, to train LLaMA-7b-LoRA, you can use CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_10k.json' --output_dir './trained_models/llama-7b-lora-math/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora
For evaluation on SVAMP as example:
CUDA_VISIBLE_DEVICES=0 python evaluate.py \ --model LLaMA-7B \ --adapter LoRA \ --dataset SVAMP \ --base_model 'yahma/llama-7b-hf' \ --lora_weights "./trained_models/llama-7b-lora-math"
If you have any questions, please let use know!
Hi @HZQ950419
Thanks for your great work, I also have a problem when I try to evaluate the fine-tuned model with lora. I find the main reason is that the output of response is none, for example:
The output on BoolQ is:
outputs: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: do runs have to be the same suit in gin rummy?\n\nAnswer format: true/false\n\n ### Response:\n ']
output:
Please answer the following question with true or false, question: do runs have to be the same suit in gin rummy?
Answer format: true/false
prediction:
label: true
---------------
test:2637/3270 | accuracy 0 0.0
---------------
Same problem.
Dear Authors,
Thanks for these great projects and your kind help. I try to reproduce all the results in the Table, But I came across several issues, Could you please explain some possible problems? 1. When I tried to Tune the model, I found the function "generate_prompt" in both finetune.py and evaluation.py can't extract data from the JSON file which tile is not "input, instruction, output, answer", So I changed the JSON file all the input name, I was wondering whether I am doing the right Jobs. Here are two examples I used. Original One in Github Repo: The One I modified: 2. I can't get an answer which even close to the right Label Since I wasn't working in the ML area before, all the metrics are new to me, But the tuned model gives some results which sound ridiculous to me, I wondered if I did something wrong, or is there any other new Metrics I should use to reproduce the Tune-model Score in The GitHub repo? I attached some results I got from my LoRA-Tuned model: BTW, When I switch the test datasets to train datasets, the accuracy get higher, but still not the same as Table list. I wondered if you can share your tuning setting if possible.