Closed realgump closed 1 year ago
Hi, Thank you for bringing up these questions.
Thanks for your quickly reply. 1. How can I enable the cache feature of API request? 2. I use data/instruction/G1_query.json
to perform evaluation, how to split the the queries into train set and test set?
data/test_query_ids
contains query ids of the test instances in each test set. You can extract every test set from data/instruction/G1_query.json
by matching the query id.
Hello, thanks for your contributions of ToolLlama and ToolEval. However, I have some questions about the ToolEval benchmark.
I noticed that some APIs may be invalid, which has also been mentioned in other issues, such as #53.Does that mean that the inference of the dataset will be affected by the health of all APIs?How can we ensure that the evaluation is fair?
I have follow the guide to train my ToolLlama model, and I want to conduct a evaluation on given dataset, for example, G1 Instruction. However, the average inference time of per query is over 1 minute, due to the expensive time cost of DFS and API request, which implies that I will have to endure nearly 1500 hours of waiting! Is there any suggestion regarding the time taken for inference?