Closed Jianqiao-Zhao closed 2 years ago
I could not reproduce the result, either.
The response to #3 may be helpful to you.
Thank you for raising this issue, and apologies about the difficulties. I think the difference in performance stems from (1) post-processing of the dataset (Section 3.4 in the paper) which was used to remove noise from the scores and (2) the fact that a different version of the DialoGPT models/scripts were used to obtain the results in the paper. As mentioned in my answer to #3, I think there are two things you can try:
"Given that each of the 4712 data points was labeled by five annotators, post-processing was used to improve the quality of the data through the removal of outliers. Given five annotations for a given question, the furthest label from the mean is removed if its distance from the mean is greater than half the standard deviation of the five annotations."
I believe that if you use the right pre-processing and the right post-processing of the dataset, the results will match the paper. Apologies that this is not straightforward -- I have not re-tested the FED metric since the original DialoGPT scripts were deprecated. If you continue having trouble, I am happy to look into this further.
Hi,
I wonder what is the right way to reproduce the FED evaluation method on the FED dataset?
I downloaded the FED dataset from the author's personal website and I only used dialogues with dialogue-level annotations, which results in 125 dialogues in total (not 124 mentioned in your original paper or other papers).
For each dialogue, I calculated a FED score following the demo provided, using "microsoft/DialoGPT-large" model. The average score of 10 dialogue-level evaluation perspectives is taken as the overall score. These 10 dialogue-level perspectives are coherent, error recovery, consistent, diverse, topic depth, likeable, understanding, flexible, informative, and inquisitive. The human evaluation score was obtained by averaging 5 "Overall" annotations.
This process ends up with a poor spearman correlation, only 0.126. If the speaker information, "User: " and "System: " are preserved in the utterances, the spearman correlation jump to 0.251. These results are far from 0.443 in the original paper.
I think there may be some differences in the data preprocessing. I really appreciate it if you could point them out. Thank you!