bjascob / amrlib

A python library that makes AMR parsing, generation and visualization simple.
MIT License
216 stars 33 forks source link

Subscore labels in evaluation #64

Closed flipz357 closed 5 months ago

flipz357 commented 5 months ago

Hi,

there is some evaluation code in amrlib for reporting fine-grained scores. I find the labeling a bit confusing. For instance, "Named Entities" only checks for ":name" relations. And so

This AMR

(r / walk-01
    :arg0 (p / person
         :name (n / name
               :op1 "Barack"
               :op2 "Obama")))

and this AMR

(r / walk-01
    :arg0 (p / person
         :name (n / name
               :op1 "Hillary"
               :op2 "Clinton")))

get a score of 100.00 (maximum score) with amrlib version 0.7.1 in the category "Named Entities":

Named Ent.       -> P: 1.000,  R: 1.000,  F: 1.000

This seems confusing and feels not really right. "Barack Obama" and "Hillary Clinton" are not the same named entities...

I also think this is not a original bug of amrlib, but rather an issue of another repository (that probably isn't maintained anymore), but the code in amrlib seems copied from there.

Maybe it simply helps to rename the categories to better indicate precisely what they're actually doing ("Named Entitiy" --> "name edge").

flipz357 commented 5 months ago

Yeah, I see that once again one may argue that it's not a problem of amrlib, but just general issues in evaluation practice in the field. So I'm gonna be closing this, maybe it still helps someone in case they struggle to understand "named entity" score.

nschneid commented 5 months ago

I am guessing the code for subscores originated from Damonte et al. 2017's evaluation. You are certainly right that enhanced smatch is not great as a general semantic similarity metric—I think it presupposed that it was comparing two parses of the same sentence, and that generally names would be aligned if they have the same strings, so the named entity subscore focuses on whether the entity types match.

flipz357 commented 5 months ago

Thank you, the parsing argument makes sense since it's a restricted scenario, where I see that you could make some stronger assumptions (but still "named entities" may be misleading even here, since the parser may parse the type correctly but the name wrongly which then isn't accounted for). I also have a bit of a similar headache with other labels like the "SRL" label..., I guess it is at least debatable if "SRL" in AMR is only arg_x triples.

I have once started implementing some more finer measures, but they're still a bit buggy and I haven't worked on it in a while (an idea was to set a sensible subgraph extraction range for different aspects and also use AMR concept classes to measure similarity wrt many more meaning aspects that AMR can capture -- events, time, location,...).

I think if the labels of the current subscores would be changed a bit to be made somehow more precise (e.g., "Named entities" -> "entity types" or ":name edges"), this could avoid some confusion I think.