Open nishthajain1611 opened 1 year ago
Hi @nishthajain1611
Thank you for your interest in our work.
Your understanding is correct. We use the micro-averaged F1 score, similar to Yadav et al. So the metric is the same. The difference is the way in which ground truth and predictions are represented, where we don't post-process to remove duplicates, thus retaining the real setting of the task. Both papers use different datasets so the scores are not directly comparable as is.
@sshon-asapp, @felixgwu, @apasad-asapp The paper End-to-end named entity recognition from English speech by Yadav et al. specifies that they do not consider duplicate (tag, phrase) pairs while considering their precision-recall scores.
Your paper On the Use of External Data for Spoken Named Entity Recognition says that it uses the f1-measures from Yadav et al., but the evaluation in slue-toolkit code that you use to evaluate the scores for the results does not remove duplicates and effectively compares (tag, phrase, identifier) triplets for the f1-score.
Could you please clarify which metric you used for the results published in your paper.