anuzzolese / oke-challenge

19 stars 8 forks source link

Issue on overlap entities in the task-1 training set #8

Open jplu opened 9 years ago

jplu commented 9 years ago

Hi,

I found a new bug in the training set, this one is about the overlap of two entities:

<http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=12,28>
        a                     nif:String , nif:RFC5147String ;
        nif:anchorOf          "Auburn, New York"@en ;
        nif:beginIndex        "12"^^xsd:int ;
        nif:endIndex          "28"^^xsd:int ;
        nif:referenceContext  <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=0,145> ;
        itsrdf:taIdentRef     <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/Auburn,_New_York> .

And

<http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=2,27>
        a                     nif:String , nif:RFC5147String ;
        nif:anchorOf          "native of Auburn, New Yor"@en ;
        nif:beginIndex        "2"^^xsd:int ;
        nif:endIndex          "27"^^xsd:int ;
        nif:referenceContext  <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/sentence-15#char=0,145> ;
        itsrdf:taIdentRef     <http://www.ontologydesignpatterns.org/data/oke-challenge/task-1/Native_of_Auburn,_New_York_1> .

I think the second one is false.

Cheers.

anuzzolese commented 9 years ago

Hi Julien,

this overlap identifies two different entities:

This distinction is correct, hence the overlap is correct as well.

giusepperizzo commented 9 years ago

well, if so why not tagging/linking New_York as well?

Would you mind to detail a bit more how you have managed nested entities in the creation of the GS?

rtroncy commented 9 years ago

@anuzzolese Can we please re-open this issue? This is serious, since nested entities is a very _hard_ problem for the community. Fine that the organizers of the challenge want to consider it but then, you need to communicate what are/were the clear guidelines provided to the annotators. For example, @giusepperizzo just gave you an example of why not all possible nested entities have been annotated? Next, you need to guarantee that consistency will have been applied between the training and the test sets.

Warning: you really enter a can of worms by considering nested entities. You are likely to have a long adjudication phase where all systems having participated in the challenge will come back and complain and ask to re-compute the figures since they will discover inconsistencies. Are you sure you want this?

anuzzolese commented 9 years ago

@giusepperizzo and @rtroncy I see you point and I agree it's very hard to address the task of overlapping entities.

I asked annotators to report possible different entities in case of overlaps. In this case the annotator found two distinct entities and considered New_York as a characterisation (a way for disambiguating) of Auburn. However, the comment is highly pertinent and this way of generating entities might introduce a worm in the evaluation. In fact, someone could say that New York is a mention to another entity.

Hence, in my opinion there are two possibilities:

The issue is reopened. WDYT?

rtroncy commented 9 years ago

Thanks for having re-opened the issue. For the challenge purpose, I think you should go for your second option, i.e. remove all identification of nested entities, in both the training and test dataset, and only consider the "largest" (this is often the longest surface form) entity.

Annotating the dataset in terms of nested entities is also a very valuable effort and, if you're willing to do it, it might be of great benefit for the community. This resource will be useful post-challenge for performing additional experiments. For example, TAC 2014 consider the nested entities as optional (for the systems which wanted to do some trials) but this was not part of the official competition since the community is still trying to learn and discover how this complex problem should be scored/evaluated, etc.

jplu commented 9 years ago

According to this issue there is again two other cases: