I can't reproduce the same results in the paper

DSincerity commented 2 years ago

Hi, Like the #2 , I also tried to reproduce the FED paper with the FED data (http://shikib.com/fed_data.json). But, I couldn't obtain the same result as the paper.

1) Average scores of Annotators. By applying the data processing method in the paper, I could only reproduce the similar results for dialog-level evaluation, not turn-level evaluation. How I can reproduce the results for turn-level ?

Avg. score in the paper
Avg. score in FED data

2) Correlation between Follow-up Utterance(FU) scores and Avg. scores of annotators I also calculated correlation between FU scores and Avg. scores of human evaluation. I obtained FU scores with the DialogGPT(large) model, following the guidance on README file (i.e., preprocessing inputs and using the FED module). However, the results of correlation were totally different from the paper. I wonder if the FU scores in the paper were calculated in the same way in this repository. How I can reproduce the same results of correlation?

Correlation in the paper
Reproduced correlation
(Dialog level) FU scores and annotator's evaluation that I've obtained. Calcuated_results.zip

Shikib commented 2 years ago

Thank you for raising this issue, and apologies about the difficulties. I am not sure if I have all of the original scripts available still, but if the suggestions below do not fix your problem -- I will try to look for them

Average score of annotators: I think the problem here most likely stems from the fact that the released data is WITHOUT OUR POST PROCESSING (Section 3.4 in the paper). If you follow the post-processing described below, the results might better match the performance of the paper. Note that I think that future work has used the FED data without the post-processing, and therefore it's reasonable to use the dataset as is and compare to the FED performance reported here (https://arxiv.org/pdf/2106.03706.pdf).

Given that each of the 4712 data points was labeled by five annotators, post-processing was used to improve the quality of the data through the removal of outliers. Given five annotations for a given question, the furthest label from the mean is removed if its distance from the mean is greater than half the standard deviation of the five annotations.

No, the FED results in the paper were not calculated the same way as this repository. This is an unfortunate issue of lack of backwards compatability in HuggingFace/DialoGPT. The results in the paper were obtained with the original DialoGPT repo, without HuggingFace. However, that approach has since been deprecated. There are slight differences, that require the dataset to be post-processed in a very specific way.

My suggestion is to rely on the modified FED script in this repo: https://github.com/exe1023/DialEvalMetrics which has its performance reported here: https://arxiv.org/pdf/2106.03706.pdf.

I believe that if you use the right pre-processing and the right post-processing of the dataset, the results will match the paper. Apologies that this is not straightforward -- I have not re-tested the FED metric since the original approach was deprecated. If you continue having trouble, I am happy to look into this further.

DSincerity commented 2 years ago

@Shikib Thank you for a quick reply and detailed explanation for my questions.

As suggested in the paper for removing outliers in five annotations given a question, I applied the post-processing method to the FED data and obtained the similar results in the paper only for dialog-level data points. Given the results, I believe that I applied the post-processing method in a right way, but, I don't know why the results are different for turn-level data points. Please take a closer look at the results that I've attached (Avg. score in FED data) again.
As you answered, it seems that the same results were not reproduced because the model used in the paper are different from the model in the repository. However, even if the same model is not used, I am still curious about such a big difference of the correlation with human annotations for the same evaluation data set.

I will retry to reproduce the paper as you suggested based on the paper(https://arxiv.org/pdf/2106.03706.pdf.) and repository( https://github.com/exe1023/DialEvalMetrics).

Thanks again for answering my questions. :)

Shikib commented 2 years ago

I'm able to reproduce your issues with the average scores on my end. I don't have time right now to dig deep into why the released data is producing different results for (1) - if this answer doesn't help you, I will dig deep on Friday. However, I was able to find my original code which produces the results in the paper. Hopefully this is sufficient for your needs -- let me know if it is not and I will try to help more.

Original analysis code: https://pastebin.com/kF7304h8 (Apologies, this is super hacky/messy code. I was never intending to release this, but in hindsight that was a mistake on my part.) Output (without the metric correlation): https://pastebin.com/2PK6BnpL Necessary data files: https://drive.google.com/file/d/1xAcXrHMEmJRxyt0iLTSnzpnIJ0163Ltj/view?usp=sharing

Hopefully that helps. I'm hoping there is no inconsistencies between the released data and the original data which I shared above.

Shikib commented 2 years ago

As for the second part of your question, the unfortunate answer is that the FED metric is extremely sensitive/brittle. Small differences in pre-processing, model initialization, calculation of the LM-likelihood, etc. will drastically reduce performance. The results reported in that paper (1) does not use the post-processed version of the FED dataset (this makes the data slightly more noisy, but it was a necessary choice in order to maintain consistency with other work that used the FED dataset) and (2) I'm not sure exactly how the 'Microsoft DialoGPT repo' got included into HuggingFace, but I'm guessing that Microsoft/Dialogpt-medium != 762M-ft. Strangely, the results look much more similar to 762M-fs (column 3 in our paper).

DSincerity commented 2 years ago

You mentioned that you was able to reproduce the same result of the average score. How could you do that? I tried to reproduce results based on the original codes( https://pastebin.com/kF7304h8 ) and necessary data(https://drive.google.com/file/d/1xAcXrHMEmJRxyt0iLTSnzpnIJ0163Ltj/view?usp=sharing) that you shared. Then, I gained the different outputs compared to what you shared, which are also totally different from those in the paper. Moreover, the original data (data files) seemed to look different from 'fed_data.json' (http://shikib.com/fed_data.json). Please check data :) Thank you

Output
Paper

DSincerity commented 2 years ago

Also, in the original code, It seems that the logic for removing outliers is different from that in the paper. In the paper, half of the standard deviation of the annotations was used. But in your codes, a standard deviation was used. Please, check the codes.

paper

code

def get_mean(row, outlier=False):
dists = [abs(e - np.mean(row)) for e in row]
skip_ind = dists.index(max(dists))
if dists[skip_ind] > np.std(row) and outlier:
  return (sum(row[:skip_ind]) + sum(row[skip_ind+1:]))/(len(row)-1)
else:
  return np.mean(row)

Shikib commented 2 years ago

The output you pasted is actually the same as the paper, it's just that the data was collected on a different scale (0-2 instead of 1-3). You need to +1 to every quality, except consistent/understandable. So Meena-interesting = 1.58 + 1 = 2.58 (same as the paper).

Can you elaborate on how the data files look different? If you have evidence of this, it would be valuable to share. The data files were cleaned up and joined together to produce the release fed_data.json.

Thank you for identifying the discrepancy between the paper and the code. That is a mistake and I will modify the arxiv version of the paper.

DSincerity commented 2 years ago

@Shikib I am sorry. I had a mistake to report the results that I gained. Based on your codes and data files, I just ran a python file (analyze.py). But, what I gained is different from what you shared. I revised the previous my comment. plz read again. Thank you.

Shikib commented 2 years ago

From what I understand, you're running exactly the code that I provided with the exact data that I provided but getting different results? That is very strange. I have made a clean copy of all the data files that I shared + my clean analyze script and I'm able to reproduce the results in the paper.

I downloaded the directory from here, and ran the analyze script inside the directory and it reproduces the results from the paper: https://drive.google.com/file/d/1-pMlmO6s0qlr5UpDf0_WKyjawck_bHv6/view?usp=sharing

Unless I'm misunderstanding, I am not able to reproduce the issue you are describing now.

DSincerity commented 2 years ago

It's very strange. I downloaded data files again (https://drive.google.com/file/d/1-pMlmO6s0qlr5UpDf0_WKyjawck_bHv6/view?usp=sharing) and ran the analyze script inside the directory just a while ago. But the result is the same as the previous one.

DSincerity commented 2 years ago

@Shikib To double check if there was a mistake in reproducing the results based on your guidance with the file and codes (https://drive.google.com/file/d/1-pMlmO6s0qlr5UpDf0_WKyjawck_bHv6/view?usp=sharing), I also asked my colleague to reproduce the results based on what you've shared. However, he also gained the same results with mine which are different from yours and them in paper. Please check the codes and files (https://drive.google.com/file/d/1-pMlmO6s0qlr5UpDf0_WKyjawck_bHv6/view?usp=sharing) again.

This is the results that my colleague shared with me

Shikib commented 2 years ago

Hi, I have just managed to reproduce your issue. Doing so allowed me to identify a bug in the code that invalidates the numbers reported in the paper (only the average score per system). The error comes from this line system_scores[system_map[e]].append(mean_scores[i]) which should actually be system_scores[system_map[e]].append(np.mean(mean_scores[3*i:3*(i+1)])).

Correcting this gives similar numbers to what you initially found in your first comment (i.e., the analyze.py script that I sent you has a bug). I apologize for this and thank you for helping identify this issue. We will make a revision to the arxiv version of the paper.

DSincerity commented 2 years ago

@Shikib Thank you for following up my issues. According to your comment, I've changed the code (128 line, analyze.py) into what you mentioned (system_scores[system_map[e]].append(np.mean(mean_scores[3*i:3*(i+1)]))). However, I still could not obtain the same results reported in the paper. Please check if my revision is right and the results that you obtained.

change of the code
results

Shikib commented 2 years ago

You misunderstood my comment. I am saying that there is a bug in my analyze.py code that incorrectly produced the average turn-level system scores in the paper.

The initial line (system_scores[system_map[e]].append(mean_scores[i])) results in some scores being randomly ignored (different on each system based on the order of conv_map.keys()). After fixing the bug, to the [3i:3*(i+1)] line, the result is the same as the straightforward analysis of fed_data.json (i.e., what you originally posted in the image below). The results in the image below are correct. The paper is wrong. I will update the paper.

Sorry about this. You should be able to still reproduce the results of FED used in future papers (https://arxiv.org/pdf/2106.03706.pdf and https://drive.google.com/file/d/1oVDMg-6HffMpkSXWylafTvVI1pDuMuqw/view + others which were done with this repo).

Shikib / fed

I can't reproduce the same results in the paper #3