Closed cs329yangzhong closed 5 months ago
Hi, the results in the paper is based on the annotations on the initial five models. Are you including the rows for the added model annotations?
Yes, I excluded model_extra, and the test size is 444, matching with the paper.
Thanks for letting me know. I'll get back to you on this over the weekend!
Hi, the distribution on both sentence and summary-level looks good to me! I have the same result as yours!
For the balanced accuracy results, It's usually hard to replicate the exact same numbers but I think your numbers is close enough with those in the paper. The only big difference is AlignScore on summary-level medium main topics, which probably due to the accumulation of errors on sentence-level. If you have a systematic way to compare all models using the same code skeleton for eval in your future work, this should be less a concern (the relevant performance/ranking of models would not change).
Thank you for confirming the distributions. Regarding the big discrepancy in summary-level mediaSum, for the sum main topic evaluation, could you share the aggregated summary-level labels for verification?
Sure! Can you show me your processed summary-level labels?
I combined all datasets here
Oh, I see. I saw your label distribution above. That was what I asked for.
Here is the stats, which is the same as yours.
MediaSum: Main: Yes: 89; No: 59 Marginal: Yes: 37; No: 37
mediasum: Main: Yes: 114; No: 44 Marginal: Yes: 24; No: 40
Thanks! Then the issue might be related to the AlignScore inferences (I used AlignScore Large checkpoint and the way we constructed the full summary (I concatenate the summary-sentences with "\n").
I need some help replicating the results in the paper. I wonder if you plan to release aggregated files for summary-level labels so I can do a cross-check.
Summary-Level
Below is my result on summary-level Balanced Accuracy using AlignScore (their released large checkpoint).
I tuned the threshold on the dev set
(df_val_temp = df_val[(df_val.dataset == origin) & (df_val.topic_split == topic)]
The original paper has
Since I derive the summary-level labels from your raw files, I wonder if the authors can help check the converted label distribution.
Sentence-Level
There are several typos in the provided files under the folder
factual_consistency
See Issue #4 , After fixing them (I changed the sent_idx for the first two typos and removed the duplicated row with the "yes" label for the last), my scores are lower than the paper reported. Could the authors help verify whether the results in the paper can be easily reproduced?The paper has
Thanks!