Open zhimin-z opened 12 months ago
No, if a task has more than one evaluation metrics here, it means an '&' relationship. More specifically, for KM and KC tasks we evaluate using both metrics.
The BLEU evaluation we used here is:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
bleu = corpus_bleu(self.refs, self.hyps, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=SmoothingFunction().method3)
- No, if a task has more than one evaluation metrics here, it means an '&' relationship. More specifically, for KM and KC tasks we evaluate using both metrics.
The BLEU evaluation we used here is:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction bleu = corpus_bleu(self.refs, self.hyps, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=SmoothingFunction().method3)
Thanks for your clarification. But still, what metrics is used after all in the leaderboard?
what metrics is used after all in the leaderboard? Could you pls be more specific here? Are you asking how the total points is calculated?
what metrics is used after all in the leaderboard? Could you pls be more specific here? Are you asking how the total points is calculated?
- Rouge is used for the evaluating the KC result, to be clear we are using rouge-l_f here.
- The metric with the bold character in the picture above is the one we used in the final leaderboard, please refer to the 2.3Contrastive Evluation System in the paper for our evaluation method, it's pretty clear there.
Thank you for your response, which has indeed provided me with the clarity I was seeking. Specifically, your explanation regarding the use of rouge-l_f for evaluating the KC result and the employment of bold characters to highlight the chosen metrics in the final leaderboard has been most enlightening.
In light of the importance of these details, I would like to propose that this information be included in the official documentation. This addition would significantly enhance the comprehension of the evaluation system for all readers and users, allowing them to better understand the process outlined in section 2.3.
Once again, I truly appreciate the information you have provided, and I believe that incorporating these details in the documentation would be a valuable step forward.
Are you evaluating F1 or EM (ROUGE or BLEU) after all for these datasets? I have no idea reading this paper. Also, BLEU has a lot of variants, which variant do you use for implementation?