Reproducibility of WAT results

Dear authors,

First of all, thank you for the great work you do in making entity linking results more comparable.

My question is specifically about GERBIL's WAT annotator: I get different results when selecting WAT as an annotator in the A2KB task versus when I use my own NIF API which simply forwards requests from GERBIL to the official WAT API.

My setup is as follows: I built my own NIF API which forwards the text GERBIL posts to the WAT API at https://wat.d4science.org/wat/tag/tag. I do not provide any additional parameters to the WAT API. I take the result from the WAT API, extract the span start and end from the fields start and end and the entity title from the title field. I create an entity URI as follows (in Python):

from urllib.parse import quote
entity_uri = "http://dbpedia.org/resource/" + quote(wiki_title.replace(" ", "_"))

Then I send the span and the entity URI back to GERBIL.

The results I get using this approach differ from those I get when simply selecting WAT as annotator in GERBIL. On KORE50 for example, I get a Micro InKB F1 score of 0.5512 using my NIF API and 0.5781 when selecting WAT as annotator. See this experiment: http://gerbil.aksw.org/gerbil/experiment?id=202409170001

I was wondering if GERBIL sets any additional parameters in the call to the API or filters the returned entities by score using a threshold. Looking at the GERBIL code, I didn't see any of that though. Can you confirm that GERBIL does not use additional API parameters and does not filter results by score? This would already help me to narrow down the problem.

I just realized that the results for the recognition task are the same, so the problem might be in the URI matching. How exactly does GERBIL create URIs from the Wikipedia titles predicted by WAT?

Any other hints to where this discrepancy could come from are highly appreciated.

Many thanks in advance!

Thank you for using GERBIL :slightly_smiling_face:

I hope we can find the difference, together :+1:

For A2KB, we send a request to "https://wat.d4science.org/wat/tag/tag". Apart from the document text and our API key, we do not use any additional parameters.

I assume that the difference comes from how we make use of confidence scores. We choose the confidence threshold that gives us the best Micro F1 score. You can find the chosen threshold in the "confidence threshold" column of the results. If you forward the confidence scores, too, you should achieve the same results.

The received Wikipedia article title is used to directly create a DBpedia IRI. With our sameAs retrieval approach described in our journal paper, we should end up with a set of IRIs including the DBpedia and Wikipedia IRIs.

I hope that this issue didn't consume a lot of your time. Please let us know if you think that the behavior of GERBIL is unreasonable and should be changed or improved. :slightly_smiling_face:

Thanks a lot for the quick reply! I had so far successfully ignored the confidence threshold column as I rarely scroll that far to the right... Using the reported confidence thresholds, I can now reproduce the results, thank you for the clarification :)

My first intuition is that setting the confidence threshold individually for each benchmark gives systems like WAT, which delegate the task of finding a good confidence threshold to the user, an unfair advantage over other systems. I know that systems like WAT, DBpedia Spotlight or TagMe sell this delegating as a feature and argue that it gives the user more control over precision vs. recall. However, most other linkers could probably output some kind of confidence score, too, but they aim at providing a single setting that gives good results for most benchmarks instead of making the user figure out a threshold that works well. In my opinion, setting the confidence threshold individually for each benchmark does also not represent a realistic scenario, as a user in a real world scenario will most likely not set a confidence threshold for each piece of text that is processed (and setting the threshold such that the results are optimal, would basically require generating a ground truth for the processed text).

I personally don't think it would be unfair to take the results that the API outputs as they are without any filtering at all since these are the results a user can expect, if they don't do any additional tweaking. Right now, the results are the upper bound of what a user can expect from the linker (without changing the API parameters).

It's an interesting problem and very relevant for me as I'm currently writing an analysis and comparison of different entity linkers, so I also need to figure out how to best deal with this... I would love to hear your point of view on it!

Again, thank you for the quick reply and clarifications, it really spared me a headache!

Yes, I am also slightly unhappy with the way we implemented comparison. I think that we could offer much more information and insights to the user about the confidence scores and their impact on the evaluation scores.

While I agree with your negative points (results become an upper bound; the comparison can be seen as unfair since we use our knowledge about the test set gold standard to find the condidence threshold), I would like to point out that previous works had a "barrier" between systems with and without confidence scores and we tried to get rid of this separation. I also think that the confidence score is actually a nice, additional feature. On the other hand, I understand the argument that a user may not make use of it :wink:

With respect to your comparison of linkers, I guess the main goal has to be the fairness of comparison. There could be different ways to handle it (I do not know the exact context of your work, so my suggestions might be wrong :sweat_smile:):

Ignore confidence scores and compare all systems based on what they provide. That would work but it is quite easy for others to argue against your results.
Make use of confidence scores. 2.1. Decide what to optimize for: we optimize the Micro F1 score, but you could also go for other Micro, Macro or weighted average scores. 2.2. Decide on which data you base the optimization: the main disadvantage that is already described above is that the optimization is done during the evaluation based on the gold standard. In our use case, this is related to the way how GERBIL works but I agree that it is sub-optimal. You could also choose one of the following (better) strategies:
- Run an evaluation on the training data and find the threshold based on these results. This is the "classic" approach but it also assumes that you differentiate between train and test datasets and that you have at least one training dataset for each test dataset.
- Run the evaluation similar to a cross validation, i.e., you gather the evaluation data for all datasets. Then, you choose dataset 1, take it out, determine the best threshold based on all other datasets and get the evaluation results for dataset 1 by applying the threshold to the evaluation data that you gathered for this dataset. Then, you would repeat the same strategy for all other datasets.
2.3. Make your critics of confidence scores a point of your work, i.e., analyze to which extend the evaluation results vary based on the confidence scores and how they are chosen. For example, compare the upper bound calculated by GERBIL to the evaluation values that you got.

Your work sounds very interesting and I would like to know more about it. Feel free to write me a mail if you have questions or if you would like to discuss how we could support your work.

dice-group / gerbil

Reproducibility of WAT results #458