Fairness and time cost about ToolEval.

realgump commented 1 year ago

Hello, thanks for your contributions of ToolLlama and ToolEval. However, I have some questions about the ToolEval benchmark.

I noticed that some APIs may be invalid, which has also been mentioned in other issues, such as #53.Does that mean that the inference of the dataset will be affected by the health of all APIs?How can we ensure that the evaluation is fair?
I have follow the guide to train my ToolLlama model, and I want to conduct a evaluation on given dataset, for example, G1 Instruction. However, the average inference time of per query is over 1 minute, due to the expensive time cost of DFS and API request, which implies that I will have to endure nearly 1500 hours of waiting! Is there any suggestion regarding the time taken for inference?

pooruss commented 1 year ago

Hi, Thank you for bringing up these questions.

To maintain fairness, we have cached all API responses during our preliminary experiments and evaluations. However, since the APIs are real-world and subject to various factors like API updates and states, unstable and unpredictable responses may occur when the requests are not covered by the cached data. We acknowledge that evaluating a model's ability to handle such real-world scenarios in absolute fairness is challenging. As a compromise, we can mock an environment by simulating API responses using GPTs when the request could not be covered by the cached data. We will consider it for future development and we are open to collaborating on this.
It is possible that you may have mixed up the test set with the train set. Our benchmark currently consists of 600 test queries. Assuming 1 minute per query, the entire evaluation would take approximately 600 minutes. Please note that there is a speed limit of 30 times per second for every toolbench key.

realgump commented 1 year ago

Thanks for your quickly reply. 1. How can I enable the cache feature of API request? 2. I use data/instruction/G1_query.json to perform evaluation, how to split the the queries into train set and test set?

pooruss commented 1 year ago

The data caching is already maintained on our server. If you are requesting RapidAPI through our server, you are already enabling the cache feature.
The directory data/test_query_ids contains query ids of the test instances in each test set. You can extract every test set from data/instruction/G1_query.json by matching the query id.

OpenBMB / ToolBench

Fairness and time cost about ToolEval. #106