bjascob / amrlib

A python library that makes AMR parsing, generation and visualization simple.
MIT License
216 stars 33 forks source link

Smatch Scoring Discrepancies #60

Closed bjascob closed 7 months ago

bjascob commented 7 months ago

The AMRBAR project noted that smatch scores from amrlib are 0.2 to 0.3 points higher than reported by amr-evaluation-tool-enhanced in this issue.

Scores between amrlib, amr-evaluation-tool-enhanced and the original snowblink smatch library should be compared for differences and variations between runs. All three libraries rely on the "snowblink" version of smatch for their overall score so they should match within the natural variation of that method.

nschneid commented 7 months ago

There's also Smatch++, which uses an ILP solver instead of hill-climbing, so it is supposed to be more accurate. https://github.com/flipz357/smatchpp by @flipz357

bjascob commented 7 months ago

I looked into this a bit, and it looks like amrlib is consistent with the latest snowblink14/smatch code (both the 1.0.4 release and the current master). The ChunchuanLv/amr-evaluation-tool-enhanced consistently gives 0.3 lower scores (as noted in the title) on the 2 models I tested.

All of of these libs use snowblink14/smatch under the hood for the algorithm, but ChunchuanLv/amr-evaluation-tool-enhanced uses an older python2 version that is very hacked up. I'm guessing that later code from snowblink14 (which is now python3) fixed something. However, I didn't dig through their issues list or try to regression test the code to find out what changed.

In short, I suspect the higher numbers are the "right" scores.

BTW... I tried running the test multiple times and you only see variation in the 4th digit. ie..

Precision: 0.83471, Recall: 0.81098, F-score: 0.82267
Precision: 0.83476, Recall: 0.81103, F-score: 0.82272
Precision: 0.83483, Recall: 0.81109, F-score: 0.82279

So it's unlikely you'd see variation when only printing 3, unless you happen to be right on the line.

flipz357 commented 7 months ago

Hi,

I think I can explain some of it.

The lower score from the old(er) Smatch script is mainly due to penalizing if the root concept is wrong. Turns out that the penalty is actually desired since the root concept is the focus of an AMR. See also context for more explanation this recent issue.

So, no, higher numbers are not right scores.

Importantly, to simply sum up:

Higher numbers due to optimal solving (SMATCH++): higher numbers are right

Higher numbers due to ignoring AMR root concept: higher numbers are wrong

flipz357 commented 7 months ago

A bit more on the variation

The differences:

Precision: 0.83471, Recall: 0.81098, F-score: 0.82267
Precision: 0.83476, Recall: 0.81103, F-score: 0.82272
Precision: 0.83483, Recall: 0.81109, F-score: 0.82279

seem very little and one would think that one shouldn't bother (I used to think like this too), but the issue is that it can be that the real Smatch score is not similar to these numbers. These numbers are result of heuristic. Sometimes it may solve some graph pairs better, and other times other graph pairs. So, on average this can mean little variation. But if you would solve all graph pairs optimally the real Smatch score may be different.

flipz357 commented 7 months ago

Hi @bjascob since the issue is closed as completed, I wonder if you have found other sources of the differences than me, which I summarized here https://github.com/goodbai-nlp/AMRBART/issues/14#issuecomment-1832226440

Having worked 4-5 years on amr/graph metrics I'm quite interested in this, and it's sort of one of the few topics where I feel I can make a half-way informed comment on...