Investigate update of datasets

dice-group / gerbil

GERBIL - General Entity annotatoR Benchmark

GNU Affero General Public License v3.0

218 stars 57 forks source link

Investigate update of datasets #170

Open RicardoUsbeck opened 7 years ago

RicardoUsbeck commented 7 years ago

There are updated datasets http://bit.ly/2gnSBLg ACE2004, AIDA-ConLL, AQUAINT and MSNBC from the following publication www.semantic-web-journal.net/content/robust-named-entity-disambiguation-random-walks-0

Investigate whether we can/should replace GERBIL's current dataset with these?

octavian-ganea commented 7 years ago

They also build a new dataset from Clueweb with 11154 mentions, the biggest dataset for entity linking to the best of my knowledge.

MichaelRoeder commented 7 years ago

From my point of view "Replacing" is a bad idea - especially with respect to the repeatability of experiments. Adding them and marking the old datasets as deprecated (or something like that) might be a better way.

However, before integrating these new datasets, we should make sure that a) they have major differences to the original datasets (especially after the same-as retrieval of GERBIL) and b) they do not introduce new errors (i.e., validate the quality with EAGLET)

Just my two cents :wink:

RicardoUsbeck commented 7 years ago

I think Michael is right, we should keep both version if they cannot be matched easily and the difference is too large marking the original as "deprecate/original" and the other as "improved" with a link to the original paper/citation.
In terms of size, I think the Wikilinks corpus is larger but it is automatically annotated. My problem with to large datasets is that we do not have infinite server capacity and already GERBIL is using a big part of a 32 core, 128 GB server. My hope is, that in the future we can migrate to the bigger HOBBIT platform (http://project-hobbit.eu/) for special datasets and tasks over Big Data volumes.