dice-group / gerbil

GERBIL - General Entity annotatoR Benchmark
GNU Affero General Public License v3.0
222 stars 58 forks source link

How to calculate Macro F1 QALD Score #320

Closed jannlemm0913 closed 5 years ago

jannlemm0913 commented 5 years ago

Hello,

we (@silvanknecht and me) are trying to recreate some of the QALD-9 results and are now wondering, how the Macro F1 QALD Score is calculated. We know that the precision is 1 if the system gives no answer (instead of precision = 0), but that should not matter for the macro F1 (as it is calculated for every question).

Is the F1 calculated from the global macro precision and recall? Or are we missing some cases where it matters that the precision is 1 even though the recall is 0 and therefore the F1 for that question is 0 aswell?

Thanks and kind regards

RicardoUsbeck commented 5 years ago

Hi, thanks for using GERBIL QA. In this paper, we describe it a bit more: http://www.semantic-web-journal.net/system/files/swj1838.pdf : image

Maybe @TortugaAttack can point you to the code which implements that later.

jannlemm0913 commented 5 years ago

A bit above the text you sent there is this: "For the macro metric, we calculate the precision, recall and F-measure per question and average these metrics individually at the end. " Is that for Macro F1 QALD different?

TortugaAttack commented 5 years ago

Hi Regarding the code:

QALD F1 Macro calculation https://github.com/dice-group/gerbil/blob/4d08ad0438688ef11b1bb68e515c344ba9e235d1/src/main/java/org/aksw/gerbil/evaluate/impl/FMeasureCalculator.java#L158

and the respective single precision, recall and f1 measure calculations: https://github.com/dice-group/gerbil/blob/4d08ad0438688ef11b1bb68e515c344ba9e235d1/src/main/java/org/aksw/gerbil/evaluate/impl/FMeasureCalculator.java#L174

MichaelRoeder commented 5 years ago

Following the code, it is different, i.e., for the QALD macro F1, macro precision and macro recall are calculated and used to calculate the F1 measure.

I would like to emphasize that this is not one of our ideas. It came from earlier QALD challenges where a script was used for the evaluation. We implemented it only for backwards compatibility. (You can see the complete discussion from https://github.com/dice-group/gerbil/issues/211#issuecomment-334168618 on)

jannlemm0913 commented 5 years ago

I see, thanks for the fast reply. We have read the issue but were unsure how the changes were implemented. Now we know and can refer to this issue in our work. Thank you.