OpenLMLab / LEval

[ACL'24 Oral] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
314 stars 13 forks source link

It seems to be some mistakes on evalution tasks #8

Closed coo00ookie closed 8 months ago

coo00ookie commented 8 months ago

Hi, I encountered some issues while evaluating some models with your evaluation code, especially on Evaluation/auto_eval.py I'd like to send PR but it's being currently not available. Please take a look. thank you in advance.

def process_output_judge(response):
    loyalty, fact = process_gt_judge(response)
    output_list = []
    for word in loyalty.split():
        if "true" in word:
            output_list.append("true")
            break
        elif "false" in word:
            output_list.append("false")
            break
    if len(output_list) == 0:
        output_list.append("<error>")
    for word in loyalty.split(): <<--- I think the second loyalty has to be modified to **fact**
        if "true" in word:
            output_list.append("true")
            break
        elif "false" in word:
            output_list.append("false")
            break
    if output_list == 1:
        output_list.append("<error>")
    return output_list # disable random guess for the dataset for high variance

and on same file, in process_gt_judge function

    match = re.search(r'\[fact: (.*?)\]', response) 

I thought that changing from **.** to **[^\\]**will be more clear, because . does not include new line character

and the last one is in the main code,

        elif "sci_fi" in args.pred_file:
            loyalty, fact = process_gt_judge(instance["gt"])
            references += [[loyalty], [fact]]
            predictions +=  process_output_judge(instance[prediction_key])

to make loyalty and fact as one pair, I think It is supposed to be written like below:

            references.append([[loyalty], [fact]])
            predictions.append(process_output_judge(instance[prediction_key]))
ChenxinAn-fdu commented 8 months ago

Thank u so much for this issue! But I find the original code works well for me. For example, if we have 2 predictions:

pred1= "True [fact: False]" 
gt1="True [fact: True]"
pred2="False [fact: True]"
gt2="True [fact: False]"

After processing, the predictions and references look like:

predictions = [True, False, False, True]
references = [[True], [True], [True], [False]]

The result is correct. Did I get u wrong here? Please feel free to let me know and give more details about your case~

coo00ookie commented 8 months ago

In my case,

If I have predictions and references like :

pred1= "True [fact: False]" 
gt1="True [fact: True]"
pred2="False [fact: True]"
gt2="True [fact: False]"

After processing, It became like below:

predictions = [True, True, False, False]
references = [[True], [True], [True], [False]]

I found it is occured by process_output_judge, especially Second loyalty.

And basically, I have a question about scoring logic even the process_output_judge has no error in it's codes. For now, It has two answer for each question (loyalty and fact). But according to @ChenxinAn-fdu 's example, It will be calculated as two question.

ChenxinAn-fdu commented 8 months ago

Apologies for the delayed response. We calculate the loyalty score according to the fiction, and fact score according to real-world knowledge separately, treating them as two distinct questions. You can also calculate these scores individually. If you want to report the average of both the loyalty score and fact score, the results are consistent.

I have updated auto_eval.py to fix some bugs in testing sci_fi. I think it can help solve your problem ^-^. @coo00ookie

coo00ookie commented 8 months ago

I apologize for my delayed reply. Also I've checked your modifications in the auto_eval.py. thanks!

ChenxinAn-fdu commented 8 months ago

Thank you again for this opening issue. If there are any other issues with this code, please feel free to let me know~