Open TurboMa opened 4 weeks ago
bfcl generate
generates the model response, bfcl evaluate
takes the output from bfcl generate
and compares them with ground truth. bfcl evaluate
will not run the generation again.
bfcl generate
generates the model response,bfcl evaluate
takes the output frombfcl generate
and compares them with ground truth.bfcl evaluate
will not run the generation again.
Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks
Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks
The simple
category on the leaderboard is an unweighted average of BFCL_V3_simple
, BFCL_V3_java
, and BFCL_V3_javascript
.
Describe the issue This is actually not a issue but a simple question which is I have run bfcl generate on llama3.1 with python_ast test category and I got list of results from model. Next, If I want to get the score, when I run bfcl evaluate, it will directly compare the model generation with the standard answer or it will run again the model to generate new answer (run generation again inside evaluation)?