amazon-science / tofueval

MIT No Attribution
25 stars 2 forks source link

Question on reproducing AlignScore results #3

Closed cs329yangzhong closed 5 months ago

cs329yangzhong commented 6 months ago

I need some help replicating the results in the paper. I wonder if you plan to release aggregated files for summary-level labels so I can do a cross-check.

Summary-Level

Below is my result on summary-level Balanced Accuracy using AlignScore (their released large checkpoint).

I tuned the threshold on the dev set (df_val_temp = df_val[(df_val.dataset == origin) & (df_val.topic_split == topic)]

image

The original paper has

image

Since I derive the summary-level labels from your raw files, I wonder if the authors can help check the converted label distribution.

image image

Sentence-Level

There are several typos in the provided files under the folder factual_consistency See Issue #4 , After fixing them (I changed the sent_idx for the first two typos and removed the duplicated row with the "yes" label for the last), my scores are lower than the paper reported. Could the authors help verify whether the results in the paper can be easily reproduced?

image

The paper has

image

Thanks!

Liyan06 commented 6 months ago

Hi, the results in the paper is based on the annotations on the initial five models. Are you including the rows for the added model annotations?

cs329yangzhong commented 6 months ago

Yes, I excluded model_extra, and the test size is 444, matching with the paper. image

Liyan06 commented 6 months ago

Thanks for letting me know. I'll get back to you on this over the weekend!

Liyan06 commented 6 months ago

Hi, the distribution on both sentence and summary-level looks good to me! I have the same result as yours!

For the balanced accuracy results, It's usually hard to replicate the exact same numbers but I think your numbers is close enough with those in the paper. The only big difference is AlignScore on summary-level medium main topics, which probably due to the accumulation of errors on sentence-level. If you have a systematic way to compare all models using the same code skeleton for eval in your future work, this should be less a concern (the relevant performance/ranking of models would not change).

cs329yangzhong commented 6 months ago

Thank you for confirming the distributions. Regarding the big discrepancy in summary-level mediaSum, for the sum main topic evaluation, could you share the aggregated summary-level labels for verification?

Liyan06 commented 6 months ago

Sure! Can you show me your processed summary-level labels?

cs329yangzhong commented 6 months ago

I combined all datasets here

Liyan06 commented 6 months ago

Oh, I see. I saw your label distribution above. That was what I asked for.

Here is the stats, which is the same as yours.

MediaSum: Main: Yes: 89; No: 59 Marginal: Yes: 37; No: 37

mediasum: Main: Yes: 114; No: 44 Marginal: Yes: 24; No: 40

cs329yangzhong commented 6 months ago

Thanks! Then the issue might be related to the AlignScore inferences (I used AlignScore Large checkpoint and the way we constructed the full summary (I concatenate the summary-sentences with "\n").