Closed coo00ookie closed 8 months ago
Thank u so much for this issue! But I find the original code works well for me. For example, if we have 2 predictions:
pred1= "True [fact: False]"
gt1="True [fact: True]"
pred2="False [fact: True]"
gt2="True [fact: False]"
After processing, the predictions and references look like:
predictions = [True, False, False, True]
references = [[True], [True], [True], [False]]
The result is correct. Did I get u wrong here? Please feel free to let me know and give more details about your case~
In my case,
If I have predictions and references like :
pred1= "True [fact: False]"
gt1="True [fact: True]"
pred2="False [fact: True]"
gt2="True [fact: False]"
After processing, It became like below:
predictions = [True, True, False, False]
references = [[True], [True], [True], [False]]
I found it is occured by process_output_judge
, especially Second loyalty.
And basically, I have a question about scoring logic even the process_output_judge
has no error in it's codes.
For now, It has two answer for each question (loyalty and fact).
But according to @ChenxinAn-fdu 's example, It will be calculated as two question.
Apologies for the delayed response. We calculate the loyalty score according to the fiction, and fact score according to real-world knowledge separately, treating them as two distinct questions. You can also calculate these scores individually. If you want to report the average of both the loyalty score and fact score, the results are consistent.
I have updated auto_eval.py
to fix some bugs in testing sci_fi. I think it can help solve your problem ^-^. @coo00ookie
I apologize for my delayed reply.
Also I've checked your modifications in the auto_eval.py
. thanks!
Thank you again for this opening issue. If there are any other issues with this code, please feel free to let me know~
Hi, I encountered some issues while evaluating some models with your evaluation code, especially on
Evaluation/auto_eval.py
I'd like to send PR but it's being currently not available. Please take a look. thank you in advance.and on same file, in
process_gt_judge
functionI thought that changing
from **.** to **[^\\]**
will be more clear, because . does not include new line characterand the last one is in the main code,
to make loyalty and fact as one pair, I think It is supposed to be written like below: