THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.24k stars 162 forks source link

[Feature] 请问你们kg的最终得分是哪个数据呀,我看你们的指标有三个F1,Exact Match和Executability,还是他们加权呀,我并没有看到加权公式 #148

Closed minleminzui closed 4 months ago

minleminzui commented 4 months ago

We use F1 score as the primary evaluation metric in our study, calculated by comparing the model’s predicted answers to the gold standard answers. In addition to F1 score, we also use the Exact Match metric. However, unlike previous studies that measure Exact Match based on the logical form, we assess it based on the exact match between the predicted and gold answer sets. Lastly, we also evaluate the Executability of the action sequences generated by the model. If the model’s action sequence produces any set of answers when executed, it scores 1.0 for Executability. If it fails to produce an answer, it scores 0.

minleminzui commented 4 months ago

还是说只看F1呀

minleminzui commented 4 months ago

We adopt question answering as the basic task formulation and consequently the answer F1 as the metric.