Averaging of CE Metrics

ChantalMP commented 1 year ago

Hi,

thanks for sharing your work.

As I understand from your paper, Table 6 in your paper reports example-based metrics (so F1 score for every report, then averaged), not micro or macro F1. Is that correct?

From where did you find out that the other papers to which you compare also use example-based F1 and not micro or macro F1?

Any hint would be appreciated.

Thanks in advance! :)

fuying-wang commented 6 months ago

Hi,

Thanks very much for the awesome work. I have the same question. It seems that the results of baselines in Table 6 are the same as the results in the original papers. While according to the code of R2Gen, it seems that they are using macro or micro-based CE metrics.

anicolson commented 6 months ago

Hi ChantalMP and fuying-wang,

Thank you for pointing this out. Our reported results in Table 6 are indeed averaged over each example (example-based CE metrics). The results for the other methods are reported from their respective papers. We found it difficult to determine how the CE scores were averaged in the methods respective papers, as this detail was not included. Based on the fact that papers prior to R2Gen reported the used method of averaging (macro- or micro-averaging), we assumed that papers such as R2Gen not mention this meant that they were not using either. Instead we assumed they were averaging over all examples (this may have been a bad assumption).

Do alleviate this discrepancy, we made sure to report how we averaged our results. Hopefully, this can be avoided in future papers on the topic. Unfortunately, we may have made the mistake of comparing to a different averaging strategy.

If they indeed used micro- or macro-averaging, and not averaging over each example, then the micro- and macro-averaged results for CvT2DistilGPT2 can be found here: https://github.com/aehrc/cvt2distilgpt2?tab=readme-ov-file#results.

fuying-wang commented 6 months ago

Hi,

Thanks very much for detailed clarification! I also noticed that R2Gen and other papers didn't mention their averaging method, which also makes me confused. Apart from this, your code and detailed results are awesome!

aehrc / cvt2distilgpt2

Averaging of CE Metrics #11