Add TAC-KBP Experiments

dice-group / gerbil

GERBIL - General Entity annotatoR Benchmark

GNU Affero General Public License v3.0

224 stars 58 forks source link

Add TAC-KBP Experiments #8

Closed MichaelRoeder closed 10 years ago

MichaelRoeder commented 10 years ago

Add experiment types from http://nlp.cs.rpi.edu/kbp/2014/KBP2014EL_V0.2.pdf

RicardoUsbeck commented 10 years ago

Make sure that the licence is not violated when the data is transformed to the annotation backend. Keep in mind that we only want open data and open source software.

giusepperizzo commented 10 years ago

similarly to #16, what does open data mean?

giusepperizzo commented 10 years ago

TAC KBP 2014 scorer: https://github.com/wikilinks/neleval

I propose to use the scorer as it is in the official repository and formatting both GS and system outputs in order to fit the expected inputs of the scorer.

Few aspects to consider: an entity is defined as an ordered list of the following features: doc_id,startOffset, endOffset,uri,salience,type

For the majority of the systems supported in GERBIL (and in NERD) doc_id, start and end offset, uri are available. Differently for the type. For instance in Babelfy we should retrieve it from a Wikipedia page or so (am I mistaken?). Similarly for the salience score.

RicardoUsbeck commented 10 years ago

Please refine what you mean by experiment types, matchings and evaluation measures and open new and separate issues for all of them

RicardoUsbeck commented 10 years ago

see #48 #49. @giusepperizzo Are there more experiment types we have to cover?

rtroncy commented 10 years ago

48 is about Typing (the R in NER)
49 is about Salience (the importance of a NE in a text?)
Obviously, you have an experiment about Linking already (the L in NEL) ... is there a github issue for this?
You may want to have an experiment for evaluating the Detection (the D in EDL), i.e. getting the good surface form of an entity, in particular when there are nested entities

RicardoUsbeck commented 10 years ago

Just to clarify:

from our point of view typing is a function that assigns rdf:type to each entity from a given set
salience means pointing out the most import (linked) entity in a text
Linking itself is called D2W (Disambiguation to Wikipedia) and is already included in the bat-framework
the detection or annotation experiment itself is also scheduled for milestone 2 #50 :)

rtroncy commented 10 years ago

Right, but the world is complex.

Typing: ok ... but a typical NER system provides more than one type for an entity, so this is a function that associates several rdf:type to an entity, right? Do you say a word with respect to which taxonomy (e.g. schema.org, YAGO, DBpedia-OWL, etc.) OR alignments between taxonomies (e.g. NERD ontology)?
Salience: the most important entity? or the most important entities? And how do you define (and evaluate) this importance?
Linking: why do you restrict to D2W? And by the way, which Wikipedia? There are many (localization) of them!
Detection: call it detection (and not annotation)

RicardoUsbeck commented 10 years ago

better we move the definition to the paper and the discussion to the mailing list.

dice-group / gerbil

Add TAC-KBP Experiments #8

48 is about Typing (the R in NER)

49 is about Salience (the importance of a NE in a text?)