KwaiKEG / KwaiAgents

A generalized information-seeking agent system with Large Language Models (LLMs).
Other
1.1k stars 105 forks source link

Bug in benchmark_eval.py? #15

Closed cteant closed 10 months ago

cteant commented 10 months ago

Hi, I found there is a bug in benchmark_eval.py. Belowed is the original code of the get_tool_metric.

def get_tool_metric(golden_toolnames, golden_tool_args, tool_name, tool_args):
    tool_metrics = []
    for golden_toolname, golden_tool_arg in zip(golden_toolnames, golden_tool_args):
        if golden_toolname == 'None':
            continue
        tool_em = 1 if tool_name == golden_toolname else 0
        avg_arg_rouges = []
        if golden_tool_arg == {} and tool_args == {}:
            avg_arg_rouges = [1.]
        elif tool_args != {}:
            for k,v in golden_tool_arg.items():
                for k1,v1 in tool_args.items():
                    if k1 == k:
                        avg_arg_rouges.append(calculate_rouge_l(v, v1))
                        break
                avg_arg_rouges.append(0.)
        else:
            avg_arg_rouges = [0.]
        arg_rouge = sum(avg_arg_rouges) / len(avg_arg_rouges) if len(avg_arg_rouges)>0 else 0 
        tool_metrics.append(arg_rouge * tool_em)

    if len(tool_metrics) == 0:
        tool_metrics = [0.]
    return max(tool_metrics)

When tool_args!={}, the logic is to find a matched key between tool_args and gloden_tool_arg, and if there is a match, i.e., k1==k, we calculate the rougle_l and append to the list avg_arg_rouges. If there is not matched key, a 0 score will be append to avg_arg_rouges. However, when there is a match, accutually two values will be appended to avg_arg_rouges, one is the rouge_l, the other is 0.

A revised version should be as follows:

def get_tool_metric(golden_toolnames, golden_tool_args, tool_name, tool_args):
    tool_metrics = []
    for golden_toolname, golden_tool_arg in zip(golden_toolnames, golden_tool_args):
        if golden_toolname == 'None':
            continue
        tool_em = 1 if tool_name == golden_toolname else 0
        avg_arg_rouges = []
        if golden_tool_arg == {} and tool_args == {}:
            avg_arg_rouges = [1.]
        elif tool_args != {}:
            # original implementation
            # for k,v in golden_tool_arg.items():
            #     for k1,v1 in tool_args.items():
            #         if k1 == k:
            #             avg_arg_rouges.append(calculate_rouge_l(v, v1))
            #             break
            #     avg_arg_rouges.append(0.)

            # new implementation
            for k,v in golden_tool_arg.items():
                match_k = False
                for k1,v1 in tool_args.items():
                    if k1 == k:
                        avg_arg_rouges.append(calculate_rouge_l(v, v1))
                        match_k = True
                        break
                if not match_k:
                    avg_arg_rouges.append(0.)
        else:
            avg_arg_rouges = [0.]
        arg_rouge = sum(avg_arg_rouges) / len(avg_arg_rouges) if len(avg_arg_rouges)>0 else 0 
        tool_metrics.append(arg_rouge * tool_em)

    if len(tool_metrics) == 0:
        tool_metrics = [0.]
    return max(tool_metrics)
cteant commented 10 months ago

Would be great if you can update the score using the new implementation, thanks for the nice work : )