THU-KEG / KoLA

[ICLR24] The open-source repo of THU-KEG's KoLA benchmark.
https://arxiv.org/abs/2306.09296
50 stars 0 forks source link

Ambiguity on the evaluation metrics #11

Open zhimin-z opened 8 months ago

zhimin-z commented 8 months ago

image Are you evaluating F1 or EM (ROUGE or BLEU) after all for these datasets? I have no idea reading this paper. Also, BLEU has a lot of variants, which variant do you use for implementation?

xurh20 commented 5 months ago
zhimin-z commented 5 months ago
  • No, if a task has more than one evaluation metrics here, it means an '&' relationship. More specifically, for KM and KC tasks we evaluate using both metrics.
  • The BLEU evaluation we used here is:

    from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
    
    bleu = corpus_bleu(self.refs, self.hyps, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=SmoothingFunction().method3)

Thanks for your clarification. But still, what metrics is used after all in the leaderboard?

xurh20 commented 5 months ago

what metrics is used after all in the leaderboard? Could you pls be more specific here? Are you asking how the total points is calculated?

zhimin-z commented 5 months ago

what metrics is used after all in the leaderboard? Could you pls be more specific here? Are you asking how the total points is calculated?

  1. Is the BLEU or ROUGE used for the leaderboard?
  2. Is the EM or F1 used for the leaderboard?
xurh20 commented 5 months ago
  1. Rouge is used for the evaluating the KC result, to be clear we are using rouge-l_f here.
  2. The metric with the bold character in the picture above is the one we used in the final leaderboard, please refer to the 2.3Contrastive Evluation System in the paper for our evaluation method, it's pretty clear there.
zhimin-z commented 5 months ago
  1. Rouge is used for the evaluating the KC result, to be clear we are using rouge-l_f here.
  2. The metric with the bold character in the picture above is the one we used in the final leaderboard, please refer to the 2.3Contrastive Evluation System in the paper for our evaluation method, it's pretty clear there.

Thank you for your response, which has indeed provided me with the clarity I was seeking. Specifically, your explanation regarding the use of rouge-l_f for evaluating the KC result and the employment of bold characters to highlight the chosen metrics in the final leaderboard has been most enlightening.

In light of the importance of these details, I would like to propose that this information be included in the official documentation. This addition would significantly enhance the comprehension of the evaluation system for all readers and users, allowing them to better understand the process outlined in section 2.3.

Once again, I truly appreciate the information you have provided, and I believe that incorporating these details in the documentation would be a valuable step forward.