Handling null entities - Githubissues

UKPLab / starsem2018-entity-linking

Accompanying code for our *SEM 2018 @ NAACL 2018 paper "Mixing Context Granularities for Improved Entity Linking on Question Answering Data across Entity Categories"

Apache License 2.0

59 stars 16 forks source link

Handling null entities #9

Closed debayan closed 4 years ago

debayan commented 4 years ago

Hi,

In the test datasets there are some sentences which have no entities, or null entities. How do you handle this during evaluation?

daniilsorokin commented 4 years ago

Can you point to an example? Ideally it should be the case.

In terms of the implementation, if it encounters an example without annotations, then it treats it like you should have predicted nothing.

debayan commented 4 years ago

For example in webquestions test set we have the following

{"question_id": "WebQTest-3", "utterance": "who plays ken barlow in coronation street?", "entities_fb": ["m.015lwh", "m.01_2n"], "entities": [null, "Q945030"], "main_entity_text": "Coronation Street", "main_entity": "Q945030", "main_entity_fb": "m.01_2n", "main_entity_tokens": "coronation street", "main_entity_pos": [24, 41], "entity_classes": [null, "product"]}

See the null in entities. I think this should be skipped during f1 calculation with no penalties whatsoever.

And then we also have {"question_id": "WebQTest-521", "utterance": "who was anakin skywalker?", "entities_fb": [], "entities": []

which is a different case, but you already explained what you do here.

debayan commented 4 years ago

i also wanted to know, what kind of text search do you use for the n-grams? Do you try edit distance? And how many candidates do you consider for text search for each n-gram. I ask because I am getting low recall when using top 30 on elasticsearch.

daniilsorokin commented 4 years ago

The first case is clearly the problem of mapping from FB too Wikidata, for some FB entities there were just no information, so we kept them as null. I think for F1 calculation we actually included them because we compared against some systems that use FB instead of Wikidata. This did put our system at a disadvantage, of course.

For the text search we used the CONTAINS method from the Virtuoso which is arguably not the best option. It checks if there is an entity label in the data base that contains the search query. I have experimented with edit distance at some point but it introduced more noise than useful information. We kept 50 candidates.

debayan commented 4 years ago

Ok thanks a lot, closing this issue.