Hi, I found there is a bug in benchmark_eval.py. Belowed is the original code of the get_tool_metric.
def get_tool_metric(golden_toolnames, golden_tool_args, tool_name, tool_args):
tool_metrics = []
for golden_toolname, golden_tool_arg in zip(golden_toolnames, golden_tool_args):
if golden_toolname == 'None':
continue
tool_em = 1 if tool_name == golden_toolname else 0
avg_arg_rouges = []
if golden_tool_arg == {} and tool_args == {}:
avg_arg_rouges = [1.]
elif tool_args != {}:
for k,v in golden_tool_arg.items():
for k1,v1 in tool_args.items():
if k1 == k:
avg_arg_rouges.append(calculate_rouge_l(v, v1))
break
avg_arg_rouges.append(0.)
else:
avg_arg_rouges = [0.]
arg_rouge = sum(avg_arg_rouges) / len(avg_arg_rouges) if len(avg_arg_rouges)>0 else 0
tool_metrics.append(arg_rouge * tool_em)
if len(tool_metrics) == 0:
tool_metrics = [0.]
return max(tool_metrics)
When tool_args!={}, the logic is to find a matched key between tool_args and gloden_tool_arg, and if there is a match, i.e., k1==k, we calculate the rougle_l and append to the list avg_arg_rouges. If there is not matched key, a 0 score will be append to avg_arg_rouges. However, when there is a match, accutually two values will be appended to avg_arg_rouges, one is the rouge_l, the other is 0.
A revised version should be as follows:
def get_tool_metric(golden_toolnames, golden_tool_args, tool_name, tool_args):
tool_metrics = []
for golden_toolname, golden_tool_arg in zip(golden_toolnames, golden_tool_args):
if golden_toolname == 'None':
continue
tool_em = 1 if tool_name == golden_toolname else 0
avg_arg_rouges = []
if golden_tool_arg == {} and tool_args == {}:
avg_arg_rouges = [1.]
elif tool_args != {}:
# original implementation
# for k,v in golden_tool_arg.items():
# for k1,v1 in tool_args.items():
# if k1 == k:
# avg_arg_rouges.append(calculate_rouge_l(v, v1))
# break
# avg_arg_rouges.append(0.)
# new implementation
for k,v in golden_tool_arg.items():
match_k = False
for k1,v1 in tool_args.items():
if k1 == k:
avg_arg_rouges.append(calculate_rouge_l(v, v1))
match_k = True
break
if not match_k:
avg_arg_rouges.append(0.)
else:
avg_arg_rouges = [0.]
arg_rouge = sum(avg_arg_rouges) / len(avg_arg_rouges) if len(avg_arg_rouges)>0 else 0
tool_metrics.append(arg_rouge * tool_em)
if len(tool_metrics) == 0:
tool_metrics = [0.]
return max(tool_metrics)
Hi, I found there is a bug in benchmark_eval.py. Belowed is the original code of the get_tool_metric.
When
tool_args!={}
, the logic is to find a matched key betweentool_args
andgloden_tool_arg
, and if there is a match, i.e., k1==k, we calculate the rougle_l and append to the listavg_arg_rouges
. If there is not matched key, a 0 score will be append toavg_arg_rouges
. However, when there is a match, accutually two values will be appended toavg_arg_rouges
, one is the rouge_l, the other is 0.A revised version should be as follows: