Shikib / fed

Code for SIGdial 2020 paper: Unsupervised Evaluation of Interactive Dialog with DialoGPT (https://arxiv.org/abs/2006.12719)
28 stars 5 forks source link

How to Reproduce the Results on the FED Dataset? #2

Closed Jianqiao-Zhao closed 2 years ago

Jianqiao-Zhao commented 2 years ago

Hi,

I wonder what is the right way to reproduce the FED evaluation method on the FED dataset?

I downloaded the FED dataset from the author's personal website and I only used dialogues with dialogue-level annotations, which results in 125 dialogues in total (not 124 mentioned in your original paper or other papers).

For each dialogue, I calculated a FED score following the demo provided, using "microsoft/DialoGPT-large" model. The average score of 10 dialogue-level evaluation perspectives is taken as the overall score. These 10 dialogue-level perspectives are coherent, error recovery, consistent, diverse, topic depth, likeable, understanding, flexible, informative, and inquisitive. The human evaluation score was obtained by averaging 5 "Overall" annotations.

This process ends up with a poor spearman correlation, only 0.126. If the speaker information, "User: " and "System: " are preserved in the utterances, the spearman correlation jump to 0.251. These results are far from 0.443 in the original paper.

I think there may be some differences in the data preprocessing. I really appreciate it if you could point them out. Thank you!

oceanos74 commented 2 years ago

I could not reproduce the result, either.

Shikib commented 2 years ago

The response to #3 may be helpful to you.

Thank you for raising this issue, and apologies about the difficulties. I think the difference in performance stems from (1) post-processing of the dataset (Section 3.4 in the paper) which was used to remove noise from the scores and (2) the fact that a different version of the DialoGPT models/scripts were used to obtain the results in the paper. As mentioned in my answer to #3, I think there are two things you can try:

  1. If you follow the post-processing described below, the results might better match the performance of the paper. Note that I think that future work has used the FED data without the post-processing, and therefore it's reasonable to use the dataset as is and compare to the FED performance reported here (https://arxiv.org/pdf/2106.03706.pdf).

"Given that each of the 4712 data points was labeled by five annotators, post-processing was used to improve the quality of the data through the removal of outliers. Given five annotations for a given question, the furthest label from the mean is removed if its distance from the mean is greater than half the standard deviation of the five annotations."

  1. My suggestion is to rely on the modified FED script in this repo: https://github.com/exe1023/DialEvalMetrics which has its performance reported here: https://arxiv.org/pdf/2106.03706.pdf (this is on the version of the dataset that is not post-processed).

I believe that if you use the right pre-processing and the right post-processing of the dataset, the results will match the paper. Apologies that this is not straightforward -- I have not re-tested the FED metric since the original DialoGPT scripts were deprecated. If you continue having trouble, I am happy to look into this further.