Why do we have to apply the credentials one file at a time? Don't all the files get concatenated by eval_data_compilation.py?
The TEST_CATEGORY is not the same thing between openfunctions_evaluation.py and eval_checker/eval_runner.py ? Why are the test categories different?
Should the MODEL_NAME be the same between openfunctions_evaluation.py and eval_checker/eval_runner.py ? or is there a reason you might want them to be different?
It would be helpful to more fully describe what each step is doing and why if possible. Also, another question is the OMDB API is really flaky in my experience, I intermittently get 401 errors for no reason, even on the same request (sometimes 200, sometimes 401). But I couldn't figure out how to turn test cases that involved this API off, and it caused things to fail. Perhaps by understanding what each step is meant to do in more detail it will help me run the benchmark!
Thank you for raising this issue. We have addressed all the concerns you mentioned. Sorry for the delay in the merging process.
PR #508 and #512 address your first concern.
PR #506 addresses your second concern.
PR #439 addresses your third concern.
Regarding the reliability of the API, we are monitoring it closely and will consider replacements if any continue to be unreliable. Currently, they are all functioning properly. Further, the API sanity check has been set to be off by default in PR #496.
I had trouble following the instructions for running the benchmark
eval_data_compilation.py
?openfunctions_evaluation.py
andeval_checker/eval_runner.py
? Why are the test categories different?MODEL_NAME
be the same betweenopenfunctions_evaluation.py
andeval_checker/eval_runner.py
? or is there a reason you might want them to be different?It would be helpful to more fully describe what each step is doing and why if possible. Also, another question is the
OMDB API
is really flaky in my experience, I intermittently get 401 errors for no reason, even on the same request (sometimes 200, sometimes 401). But I couldn't figure out how to turn test cases that involved this API off, and it caused things to fail. Perhaps by understanding what each step is meant to do in more detail it will help me run the benchmark!cc: @HuanzhiMao