ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.19k stars 930 forks source link

Clarify Documentation About Running The Benchmark #502

Open hamelsmu opened 2 months ago

hamelsmu commented 2 months ago

I had trouble following the instructions for running the benchmark

  1. Why do we have to apply the credentials one file at a time? Don't all the files get concatenated by eval_data_compilation.py?

image

  1. The TEST_CATEGORY is not the same thing between openfunctions_evaluation.py and eval_checker/eval_runner.py ? Why are the test categories different?

image

  1. Should the MODEL_NAME be the same between openfunctions_evaluation.py and eval_checker/eval_runner.py ? or is there a reason you might want them to be different?

It would be helpful to more fully describe what each step is doing and why if possible. Also, another question is the OMDB API is really flaky in my experience, I intermittently get 401 errors for no reason, even on the same request (sometimes 200, sometimes 401). But I couldn't figure out how to turn test cases that involved this API off, and it caused things to fail. Perhaps by understanding what each step is meant to do in more detail it will help me run the benchmark!

cc: @HuanzhiMao

HuanzhiMao commented 1 month ago

Hi @hamelsmu,

Thank you for raising this issue. We have addressed all the concerns you mentioned. Sorry for the delay in the merging process.