artidoro / frank

FRANK: Factuality Evaluation Benchmark
MIT License
51 stars 4 forks source link

Questions on scorings in human_annotations.json #5

Open sonsus opened 2 years ago

sonsus commented 2 years ago

Hi, I appreciate that I found a pretty good benchmark for the summarization metrics. I have several things to ask after going through the codes and paper.

As far as I understand, human_annotations.json contains scores that is summary of human_annotations_sentence.json.

  1. (For sanity check of my understanding) For each sentence, major error type is considered as final label of that sentence that affects the score (such as NoE for factuality, LinkE for LinkE & Discourse Errors). So if I see one NoE sentence and 2 LinkE sentences for an article, that should be scored as {Factuality: 0.333, Discourse_Error: 0.333} while Semantic_Frame_Errors and Content Verifiability Errors being 1.0 which means the summary sentences are free from those type of errors but has Discourse Error which harms Factuality to 1/3.

  2. What I don't understand is the way of determining 'Flip' scores. First I thought it should be 1-ErrorType (e.g.Discourse Error for Flip Discourse Error). I still couldn't find any explainable way of making those scores from the labels. I tried to find some piece of code that generates human_annotations.json, but nothing indicates how to make Flip scores from the original ones explicitly. I think I got the motivation of applying flip scores for ablation study but not quite sure about how are they being generated.

Thanks again for the great piece of work. If you kindly explain this to me, it would be of great help. =]