random IAA for same graph, random alignments

flipz357 commented 5 months ago

Hi, thanks for this project!

are you interested in adding smatchpp as the main solver? How it's currently setup, I'm afraid all parts of scripts that rely on Smatch can provide wrong and random results, since they use Smatch hillclimbing.

Here's a simple example. I use this file (took it from Bram Vanroy who filed similar issue in the snowblink14/smatch repo): test.txt

And call

python inter_annotator.py --files test.txt test.txt --runs $n

E.g. this is an example result:

Smatch F1: 90.00 differences: 15.0000

Doesn't matter if $n is 1, or 5, or 15. I get wrong and random result. The true Smatch IAA for the same input file of course would be 100 Smatch F1, and there would be zero differences (not 15).

Maybe you know some other way to fix this issue, but I think it's clearly due to using heuristic solver.

flipz357 commented 3 months ago

@jheinecke ping in case you don't get notified by issues in this repo

jheinecke commented 3 months ago

Hi, I should have got the notification (since I'm the main contributor) but I didn't get it, strange enough. So thanks for pinging me again. Nice example where smatch fails ! I'm very much interested to add smatchpp. I'll probably have to add your repo as submodule and use your API. I'll contact you (on smatchpp) in case of problems. Which method in which file provides the API to compare two AMR graphs?

flipz357 commented 3 months ago

Cool! The main access point / API is in bindings.py.

You could also simply use it with Goodmami's popular Penman reader. Maybe this is then exactly what you might want: Reading the graphs with Goodmami's Penman reader (The native Penman reader of smatchpp is also more accurate than the reader of original smatch, but it is not as thoroughly tested), and then matching the graphs with ILP optimal solver.

from smatchpp import Smatchpp, solvers, data_helpers
graph_reader = data_helpers.GoodmamiPenmanReader()
ilp = solvers.ILP()
measure = Smatchpp(graph_reader=graph_reader, alignmentsolver=ilp)
match, optimization_status, alignment = measure.process_pair("(t / test)", "(t / test)")
print(match)

A way to simply feed triples directly into Smatchpp is explained here.

Note that the code variant above also doesn't use a standardizer, which means that the graphs

(d / Duck)

and

(d / duck)

will be counted as different. I feel that for parsing evaluation the graphs should be considered as the same, but for IAA maybe you want to be really strict and point at all differences, even if they are possibly neglectable.

jheinecke commented 3 months ago

I think it's working. But I need another information from the comparing process (for the --compare option of the editor). In order to highlight differing concepts / edges when visually comparing two graphs Smatch (the modification I did) returns a list of all nodes and edges which are OK in each graph. I guess the information I need is somewhere hidden in the alignment but I cannot figure out how to get the information I need. For example. If I compare these graphs

(s / dose-01
  :ARG0 ( c / cat))

and

(s / sleep-01
  :ARG0 ( k1 / cat)
  :location ( k /kitchen))

My Smatch also returns

for first graph: correct instances {'c'} correced edges {('s', ':ARG0', 'c'), ('s', ':TOP', 'top'), ('c', ':ARG0-of', 's')}
for the second graph: instances {'k1'} and edges {('s', ':TOP', 'top'), ('k1', ':ARG0-of', 's'), ('s', ':ARG0', 'k1')}

Is there any way to access this information ?

flipz357 commented 3 months ago

Ah yes, that's a good point. I think there's several way to get this. I might consider writing a convenience function for this, but I think currently a way would be to reconstruct the information using an alignment.

I have an example of how to get the alignment here. It should work, I think I tested it recently. With this alignment, it would be possible to rather simply reconstruct the desired information.

Just thinking about, maybe it can be a tad misleading to speak of "correct instances/triples". These instances/triples are only correct given a specific alignment. While smatchpp does give the optimal alignment, there may be cases where there are different options of optimal alignment (>1 global optimum). In this case what we consider as "correct instance/triple" could change given an alternative alignment. In other words, counts and scores that smatchpp calculates are invariant and deterministic, but the explicit mappings/alignment could be different is these cases.

jheinecke commented 3 months ago

Finally I played with your example in Example VIII: get an alignment, but whether I use

 measure = Smatchpp(graph_reader=graph_reader, alignmentsolver=ilp)

or

 measure = Smatchpp()

print(interpretable_mapping) prints either [[('aa_x_test', 'None_None')]] or [[('aa_x_test', 'bb_y_test')]]

flipz357 commented 3 months ago

The correct output is the second: [[('aa_x_test', 'bb_y_test')]]

It means that the variable 'x' from the first graph ('aa'), which is an instance of 'test' maps to the variable 'y' which is an instance of 'test' from the second graph ('bb'). The 'aa'/'bb' and mentioning the concepts of the variables is just sugar that I added in an attempt to make it a bit more human readable.

I can't reproduce the [[('aa_x_test', 'None_None')]] at the moment, this might be a bug, I am not sure.

jheinecke commented 3 months ago

I found my bug concerning the (None, None) but there is a difference with and without the ILP solver:

def test(withilp=False):
    graph_reader = data_helpers.GoodmamiPenmanReader()
    if withilp:
        ilp = solvers.ILP()
        measure = Smatchpp(graph_reader=graph_reader, alignmentsolver=ilp)
    else:
        measure = Smatchpp(graph_reader=graph_reader)

    s1 = "(s / eat-01  :ARG0 ( c / cat))"
    s2 = "(s / sleep-01  :ARG0 ( k1 / cat)   :location ( k /kitchen))"

    g1 = measure.graph_reader.string2graph(s1)
    g1 = measure.graph_standardizer.standardize(g1)
    g2 = measure.graph_reader.string2graph(s2)
    g2 = measure.graph_standardizer.standardize(g2)
    g1, g2, v1, v2 = measure.graph_pair_preparer.prepare_get_vars(g1, g2)

    alignment, var_index, _ = measure.graph_aligner.align(g1, g2, v1, v2)
    var_map = measure.graph_aligner._get_var_map(alignment, var_index)
    interpretable_mapping = measure.graph_aligner._interpretable_mapping(var_map, g1, g2)
    print("MAPPING", withilp, interpretable_mapping)

    score = measure.score_pair(s1, s2)
    print("SCORE", score)

test() outputs:

MAPPING False [('aa_c_cat', 'bb_k1_cat'), ('aa_s_eat-01', 'bb_s_sleep-01'), ('None_None', 'bb_k_kitchen')]
SCORE {'main': {'F1': np.float64(60.0), 'Precision': np.float64(75.0), 'Recall': np.float64(50.0)}}

and test(True) returns

MAPPING True [('aa_c_cat', 'bb_k1_cat'), ('aa_s_eat-01', 'bb_s_sleep-01'), ('None_None', 'None_None')]
SCORE {'main': {'F1': np.float64(60.0), 'Precision': np.float64(75.0), 'Recall': np.float64(50.0)}}

Why is the last mapping absent when using the KP solver?

flipz357 commented 3 months ago

I think it's equivalent. The first output is a dummy alignment for "kitchen", while the second lets "kitchen" remain unaligned. As can be seen in your example, both options lead to the same Smatch. Recall that for any two Graphs there exists one correct Smatch score, but possibly multiple different alignments. For more "uniformity", it could be hard-coded that if there is a node not aligned then always return a certain format, i.e., the first or the second variant of your output (dummy alignment vs. not show up).

jheinecke commented 3 months ago

OK, I see, but I have an example of two very similar graphs, and using Solver. ILC or not changes the alignment:

(s / sleep-01
  :ARG0 ( k1 / cat)
  :location ( k /kitchen))

(s / sleep-01
  :ARG0 ( k1 / dog)
  :destination ( t /table))

If I understand correctly the Solver.ILP is not able to match the nodes k and t, since I get [('aa_k1_dog', 'bb_k1_cat'), ('aa_s_sleep-01', 'bb_s_sleep-01'), ('aa_t_table', 'None_None')] when using measure = Smatchpp(graph_reader=graph_reader, alignmentsolver=ilp) but only [('aa_k1_dog', 'bb_k1_cat'), ('aa_s_sleep-01', 'bb_s_sleep-01'), ('aa_t_table', 'None_None')] when using measure = Smatchpp(graph_reader=graph_reader) and this causes my problem in adding smatchpp to my compare mode

flipz357 commented 3 months ago

The outputs you show are the exact same?

[('aa_k1_dog', 'bb_k1_cat'), ('aa_s_sleep-01', 'bb_s_sleep-01'), ('aa_t_table', 'None_None')]

Why would you expect that k and t are matched? From "semantic" view I can see where you are coming from, but there is no Smatch gain in matching the nodes, so it doesn't matter if they're matched or not. To make a semantically more sensible alignment, a different measure/metric would be needed that sees the similarity between kitchen and table, and location and destination.

jheinecke commented 3 months ago

Sorry, copy and error. Should have been [('aa_k1_dog', 'bb_k1_cat'), ('aa_s_sleep-01', 'bb_s_sleep-01'), ('aa_t_kitchen', 'bb_k_kitchen')] in the second case. I expected that with or without ILP, I get the same result for a simple graph like this

flipz357 commented 3 months ago

I think you don't mean

[('aa_k1_dog', 'bb_k1_cat'), ('aa_s_sleep-01', 'bb_s_sleep-01'), ('aa_t_kitchen', 'bb_k_kitchen')]

but rather

[('aa_k1_dog', 'bb_k1_cat'), ('aa_s_sleep-01', 'bb_s_sleep-01'), ('aa_t_table', 'bb_k_kitchen].

Correct?

Anyway, we need to be precise here:

The result is the exact same. The result is the Smatch score.
The alignment is superficially different yes, but equivalent. It's a bit like one person saying 1 + 1 = 2 and another says 1 + 1 + 0 = 2. Two different ways, same result. Sorry for the slightly contrived example.

An example how this happens could be e.g., the ILP starts from everything unaligned. In the end, it finds the maximum Smatch score and ignores all variables that don't need be aligned (since aligning them doesn't increase Smatch). The hill-climber, however, starts from the situation that everything already is aligned (but randomly). Having reached the local optimum, there's no reason to remove any alignments, even if removing them would not lower the Smatch score.

So in the end, ILP and Hillclimber may return different alignments but result in the same score. (Same solution, different approach). Only that the ILP solution is verifiable and has optimality guarantee, whereas with the HillClimber we would never know if there isn't a higher Smatch score possible.

jheinecke commented 3 months ago

Hi, I've just pushed version 3.4 which integrates SmatchPP in both, inter_annotator.py and the editor itself (--compare ...). In both cases you need to use option -S. Have a look !

flipz357 commented 3 months ago

Cool! Maybe it can be helpful to add a bit of information in the Readme, on what's the difference.

Hill-climber smatch (default) is more efficient (is it? I just assume it is), but can provide wrong results.
Smatchpp provides verifiable and optimal results.

So if you want to use hill-climber as default, even if can return very wrong results (see opening post on top), maybe you could justify its default status with faster computation speed? Or because fewer extras need be installed?

Adding this information could help users that want to ensure that the calculated IAA is actually the real IAA.

jheinecke commented 3 months ago

Good idea to complete the doc this way. Thanks! I'll keep the old version default, to avoid unexpected changes with people using it and it is much faster, and, as you say, to keep needed installs at a minimum.

Orange-OpenSource / metamorphosed

random IAA for same graph, random alignments #13