ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.52k stars 1.01k forks source link

[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

Open TurboMa opened 4 weeks ago

TurboMa commented 4 weeks ago

Describe the issue This is actually not a issue but a simple question which is I have run bfcl generate on llama3.1 with python_ast test category and I got list of results from model. Next, If I want to get the score, when I run bfcl evaluate, it will directly compare the model generation with the standard answer or it will run again the model to generate new answer (run generation again inside evaluation)?

HuanzhiMao commented 3 weeks ago

bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.

TurboMa commented 3 weeks ago

bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.

Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks

HuanzhiMao commented 3 weeks ago

Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks

The simple category on the leaderboard is an unweighted average of BFCL_V3_simple, BFCL_V3_java, and BFCL_V3_javascript.