dice-group / gerbil

GERBIL - General Entity annotatoR Benchmark
GNU Affero General Public License v3.0
222 stars 58 forks source link

GsinKB A2KB #335

Open mickvanhulst opened 4 years ago

mickvanhulst commented 4 years ago

I am currently in the process of evaluating previous work and comparing their scores to the Gerbil platform (to observe discrepancies). In most of the previous work that I found which used the AIDA datasets, emerging entities were ignored as no ground truth was available. Now I read that in some cases, these out of KB entities were added, which could cause a discrepancy in the scores as previous work had no method of comparing the two. As such, for the D2KB task, I found it very convenient that I was able to report GSinKB scores as these scores, as far as I know, ignore such occurrences.

Now I am one step further and I would like to begin the evaluation of A2KB, but there is no GSinKB score available when using the cloud platform you provided. What is the reason for not incorporating the GSinKB scores for the A2KB task and is there a way that I could enable such scores?

MichaelRoeder commented 4 years ago

The GSinKB scores are calculated for the D2KB sub experiment. However, they are not visible in the UI since the "main" experiment type (i.e., A2KB in your case) defines which measures are shown.

As the description in our wiki shows, the GSinKB measures have been introduced for the D2KB experiment to enable the benchmarking of annotators while ignoring emerging entities. Introducing it to A2KB would be possible. However, it might lead to wrong comparisons, e.g., these measures will punish annotators that are looking for emerging entities in an A2KB, while it does not necessarily do that in a D2KB experiment.

@RicardoUsbeck @TortugaAttack what do you think?

mickvanhulst commented 4 years ago

First of all, thanks for your quick response and continued support!

To build my case a bit further, I'd like to elaborate by using a case that worries me, where I will use the example depicted in table 1 of [1] (5th row). In this scenario, it could be that I find an emerging entity, for which the original work was not punished as the original dataset does not contain out of KB ground truth (i.e. it would be ignored). However, during evaluation time on Gerbil, I am punished because it could be the case that I find the entity 'Berlin', but since I only know of entities that are within my Wikipedia KB, I will not be able to predict the correct ground truth. Enabling GSinKB would enable users to compare to most of the previous work where none of these emerging entities are added and where most of the work solely performs predictions on Wikipedia datasets. Then by using the extra measures, one could describe how well the proposed method generalizes to emerging entities.

[0] Röder, M., Usbeck, R., & Ngonga Ngomo, A. C. (2018). Gerbil–benchmarking named entity recognition and linking consistently. Semantic Web, 9(5), 605-625. [1] https://github.com/dice-group/gerbil/wiki/URI-matching

MichaelRoeder commented 4 years ago

I understand your issue. However, the GSinKB measure introduces the issue, that it "removes" annotations that might have been created by the annotation system. For D2KB, it does not really create an issue. Since we allow A2KB systems to be benchmarked with D2KB experiments, GERBIL is used to remove annotations if they do not match an annotation in the gold standard. However, an A2KB experiment does not work that way and we would have a problem if we adapt this behavior to A2KB. Let me give you an easy example:

GS: <person 1> and <person 2 (EE)> first met in <City>.
A1: |person 1|     |person 2     |              |City|
A2: |person 1|                                  |City|
A3: |person 1|                    |first|       |City|

In D2KB, person 2 as well as the wrong annotation first created by the third annotator can be ignored without discriminating one of the annotators. The experiment is only interested whether the given entities are linked correctly and narrowing the focus on person 1 and city does not really create an issue. It is arguable whether some D2KB systems behave different with or without getting the mention of person 2 as input and it would be much cleaner to reduce the datasets to inKB entities if one is only interested in those. However, as a simple, additional measure, GSinKB does not really hurt anybody in this scenario.

However, A2KB is taking the recognition into account. When using a GSinKB measure for A2KB, the question is: What has to be done with the remaining annotations that do not match one of the inKB annotations of the gold standard? There are two options that can be implemented, easily:

  1. punish the annotator and count them as FP. This would discriminate A1 since it found a correct annotation - although the measure is not interested in it.
  2. don't punish the annotator and ignore them. This gives a huge advantage to A3 which annotated a part of the text that is not a named entity with respect to the gold standard (i.e., discriminates all other annotators).

If I remember it correctly, the D2KB sub-experiment handles these cases in a slightly smarter way but I would have to double check how it works in detail which may take some more time. However, maybe it would be simpler to make the GSinKB measure of the sub-experiment visible. @TortugaAttack what do you think?

@mickvanhulst: I have a question regarding the results to which you are comparing your system to. If you know

  1. the performance of the annotators (the micro measures) and
  2. the difference between the "old" and the "new" version of the gold standards, i.e., which EE markings have been added and
  3. you are sure that you can assume that the annotators to which you want to compare yourself to are are not able to find EEs

shouldn't it be possible to simply calculate the micro recall and (based on that and the original precision) the F1-score they would have achieved? For example, if you have the old micro recall r_o and the number of annotations in the old and in the new gold standard (a_o and a_n), the new recall should be

r_n = (r_o * a_o) / a_n

Or am I missing something? :thinking:

mickvanhulst commented 4 years ago

Thank you for your example. I would say that option 1 is not a viable solution as I don't think we can punish an annotator for finding something of which we do not know if it is correct or wrong. As such, I think option 2 is the best solution as this solely looks at the entities of which we have knowledge.

Regarding your questions, I cannot assume 3, because in some cases the entity recognition model is able to find the annotations, but the entity linking system is not able to correctly link them to the knowledge base, which is restricted to DBPedia.