Open HongdongZheng opened 7 months ago
Hi, I'm having the same issue with the trained ToolLLM-7b-lora we follow the scripts described first we run inference on test queries with ToolBench/ToolLLaMA-7b-LoRA from huggingface then we run pass rate script: export CONVERTED_ANSWER_PATH=../../data/model_predictions_converted export SAVE_PATH=pass_rate_results export CANDIDATE_MODEL=ToolBench/ToolLLaMA-7b-LoRA export API_POOL_FILE=api_key.json
python eval_pass_rate.py \ --converted_answer_path ${CONVERTED_ANSWER_PATH} \ --save_path ${SAVE_PATH} \ --reference_model ${CANDIDATE_MODEL} \ --test_ids ../../data/test_query_ids/ \ --max_eval_threads 10 \ --evaluate_times 1
and we get pass rate between 0.3-0.4 for all six test sets
what are we missing? can you help please
following
We try to reproduce the official results according to the guidance on the tooeval webpage. We download the reproduction data reproduction_data.zip through Google Drive, unzip it and put the reproduction_data under ToolBench/data/, and skip the data preparation process. We run the follow scripts to obtain the pass rate of model chatgpt_cot:
export CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/ export SAVE_PATH=/tmp/data/pass_rate_results export CANDIDATE_MODEL=chatgpt_cot export API_POOL_FILE=../../data/openai_key_json_file.json
mkdir $SAVE_PATH python3 eval_pass_rate.py \ --converted_answer_path ${CONVERTED_ANSWER_PATH} \ --save_path ${SAVE_PATH} \ --reference_model ${CANDIDATE_MODEL} \ --test_ids ../../data/test_query_ids/ \ --max_eval_threads 8 \ --evaluate_times 4
And the results of pass rate are as follow:![image](https://github.com/OpenBMB/ToolBench/assets/24677169/43f874ad-dc44-4057-aee8-a7b5a60d089b)
These results are much lower than the results published in the code repository and paper. Where did we go wrong in our reproduction steps or what did we miss something? Can you give us some advice?