OpenBMB / ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
https://openbmb.github.io/ToolBench/
Apache License 2.0
4.62k stars 397 forks source link

Pass rate cannot be reproduced #209

Open HongdongZheng opened 7 months ago

HongdongZheng commented 7 months ago

We try to reproduce the official results according to the guidance on the tooeval webpage. We download the reproduction data reproduction_data.zip through Google Drive, unzip it and put the reproduction_data under ToolBench/data/, and skip the data preparation process. We run the follow scripts to obtain the pass rate of model chatgpt_cot:

export CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/ export SAVE_PATH=/tmp/data/pass_rate_results export CANDIDATE_MODEL=chatgpt_cot export API_POOL_FILE=../../data/openai_key_json_file.json

mkdir $SAVE_PATH python3 eval_pass_rate.py \ --converted_answer_path ${CONVERTED_ANSWER_PATH} \ --save_path ${SAVE_PATH} \ --reference_model ${CANDIDATE_MODEL} \ --test_ids ../../data/test_query_ids/ \ --max_eval_threads 8 \ --evaluate_times 4

And the results of pass rate are as follow: image

These results are much lower than the results published in the code repository and paper. Where did we go wrong in our reproduction steps or what did we miss something? Can you give us some advice?

shadii4 commented 7 months ago

Hi, I'm having the same issue with the trained ToolLLM-7b-lora we follow the scripts described first we run inference on test queries with ToolBench/ToolLLaMA-7b-LoRA from huggingface then we run pass rate script: export CONVERTED_ANSWER_PATH=../../data/model_predictions_converted export SAVE_PATH=pass_rate_results export CANDIDATE_MODEL=ToolBench/ToolLLaMA-7b-LoRA export API_POOL_FILE=api_key.json

python eval_pass_rate.py \ --converted_answer_path ${CONVERTED_ANSWER_PATH} \ --save_path ${SAVE_PATH} \ --reference_model ${CANDIDATE_MODEL} \ --test_ids ../../data/test_query_ids/ \ --max_eval_threads 10 \ --evaluate_times 1

and we get pass rate between 0.3-0.4 for all six test sets

what are we missing? can you help please

caixd-220529 commented 2 months ago

following