delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

Exhaustive MRS comparison needed for faster debugging #315

Open olzama opened 4 years ago

olzama commented 4 years ago

I think the current MRS comparison is optimized for deciding whether or not two MRS are isomorphic as fast as possible (returning False as soon as possible). But this means something like a Matrix regression test needs to be inspected by hand in full in cases when there is in fact an improvement, so, two MRS need to be compared by hand before the test can be updated with confidence, that the only diff is in fact the improvement and there aren't any other diffs. There are sometimes reorderings due to change in append, etc., which makes manual comparison very time-consuming. A function which would make an exhaustive comparison and then return the specific information on where the two MRS do not match would be incredibly useful, as currently, for example, I have over 50 tests which I need to inspect and update, and I think they all just have a simple difference in them but I cannot be sure, so, have to manually inspect every one. And they all feature reorderings in HCONS which I don't care about. In principle, someone may make a change in the future which will affect all 500+ tests similarly. (I may look into writing this myself at some point... still debating what will take me more time: just do these 50 tests or try to write a function :) ).

olzama commented 4 years ago

I started working on this, and for now I just have this (but, assuming it is working correctly, this is already helpful):

NOTE: parsed 7 / 16 sentences, avg 563k, time 0.03696s

[comparing to gold]
  Current profile: /Users/olzama/Research/matrix-git/tests/regression/home/current/ccomp-illustr2-tur
  Gold profile: /Users/olzama/Research/matrix-git/tests/regression/home/gold/ccomp-illustr2-tur
  1                                        <0,1,0>
  3                                        <0,1,0>
  7                                        <1,0,1>
  8                                        <1,0,1>
  10                                       <1,0,1>
  13                                       <1,0,1>
  14                                       <0,1,0>

 ******** Messages from MRS comparison *********

7: Length of HCONS, ICONS, or variables is different.

8: Length of HCONS, ICONS, or variables is different.

10: Length of HCONS, ICONS, or variables is different.

13: Length of HCONS, ICONS, or variables is different.

Result: FAIL

But I am not sure this is the best way. It's just something I did quickly. What I did is I copied the _vf2_feasible function and modified the copy into _vf2_feasible_exhaustive which does not return until it made all the checks, and if any of the checks fails, it simply records a message, which check failed. It is of course not the same as to actually show which part of the MRS differ! So I am not making a pull request or anything at this stage; just documenting here what I am doing.

olzama commented 4 years ago

The above example is saying that the only check that failed was the "simple tests" check for the length of ICONS and such. If any other checks also failed, there would be a message saying something like "Consistency comparison failed". Doesn't help me find which items differ but at least if there is no such message, perhaps I can assume that all the other checks succeeded? I am not 100% sure of this, actually, particularly that all the other checks will even work correctly if the lengths check fails. If that's not the case, then this will be harder to do than I thought.

olzama commented 4 years ago

I tested this a bit and am now sure that what I have does not work correctly. For example, I changed transitive-lex-item to not actually link the ARG1 and ARG2, but the test passes :).

olzama commented 4 years ago

Ah, that was a very simple bug. So now I get somewhat uninformative messages which are nonetheless indicative that something failed while comparing two predications, rather than the length of the predication list:

 ******** Messages from MRS comparison *********
1: None comparison failed.
Length comparison failed.
olzama commented 4 years ago

And looks like I can easily make it somewhat more(?) informative like this:

******** Messages from MRS comparison *********

1: Length comparison failed between e2 and e2.
None comparison failed between h0 and e2.
Length comparison failed between h0 and e2.
None comparison failed between h1 and e2.
Length comparison failed between h1 and e2.
None comparison failed between h10 and e2.
Length comparison failed between h10 and e2.
None comparison failed between h11 and e2.
Length comparison failed between h11 and e2.
None comparison failed between h12 and e2.
Length comparison failed between h12 and e2.
None comparison failed between h4 and e2.
Length comparison failed between h4 and e2.
None comparison failed between h5 and e2.
Length comparison failed between h5 and e2.
None comparison failed between h6 and e2.
Length comparison failed between h6 and e2.
None comparison failed between h7 and e2.
Length comparison failed between h7 and e2.
None comparison failed between h8 and e2.
Length comparison failed between h8 and e2.
None comparison failed between q3 and e2.
None comparison failed between q9 and e2.
None comparison failed between x3 and e2.
Length comparison failed between x3 and e2.
None comparison failed between x9 and e2.
Length comparison failed between x9 and e2.

So I suppose I could easily detect from looking at this that nothing matched e2 (in one of the MRS; should probably indicate which one has it as e2... In this case it is both but it does not have to be the case) -- but perhaps it does tell me at this point that I do not need to check anything but e2? Or does it?

Anyway, I am far from sure that this is what needs to be done :).

olzama commented 4 years ago

Nah, I think that's as uninformative as before... e2 seems just to be some kind of top structure.

olzama commented 4 years ago

Or maybe it does work, kind of (btw I hope it is OK that I am talking to myself here?)

Below, I broke specifically something in a noun relation, keeping the main event intact. And while ultimately it starts comparing the main event to everything again, it does start with notifying me that something is wrong with x3. So perhaps I can rely on this somewhat (not fully).

 ******** Messages from MRS comparison *********

1: None comparison failed between x3 and x3.
None comparison failed between x9 and x3.
Consistency comparison failed between x9 and x3.
None comparison failed between x3 and h0.
Length comparison failed between x3 and h0.
Consistency comparison failed between x3 and h0.
None comparison failed between x9 and h0.
Length comparison failed between x9 and h0.
Consistency comparison failed between x9 and h0.
None comparison failed between x3 and h1.
Length comparison failed between x3 and h1.
Consistency comparison failed between x3 and h1.
None comparison failed between x9 and h1.
Length comparison failed between x9 and h1.
Consistency comparison failed between x9 and h1.
None comparison failed between h0 and e2.
Length comparison failed between h0 and e2.
None comparison failed between h1 and e2.
Length comparison failed between h1 and e2.
None comparison failed between h10 and e2.
Length comparison failed between h10 and e2.
None comparison failed between h11 and e2.
Length comparison failed between h11 and e2.
None comparison failed between h12 and e2.
Length comparison failed between h12 and e2.
None comparison failed between h4 and e2.
Length comparison failed between h4 and e2.
None comparison failed between h5 and e2.
Length comparison failed between h5 and e2.
None comparison failed between h6 and e2.
Length comparison failed between h6 and e2.
None comparison failed between h7 and e2.
Length comparison failed between h7 and e2.
None comparison failed between h8 and e2.
Length comparison failed between h8 and e2.
None comparison failed between q3 and e2.
None comparison failed between q9 and e2.
None comparison failed between x3 and e2.
None comparison failed between x9 and e2.
goodmami commented 4 years ago

I've been a bit too busy to look at this properly. So to recap, you want one or more of the following I guess?

  1. to print why the isomorphism check fails
  2. to find all differences found by the isomorphism check
  3. to discover differences between two MRSs

(1) is certainly doable, but I'd use logging so it can be disabled. But I wouldn't want to log node comparisons inside the recursive function (see (2)).

(2) is not feasible because the algorithm compares just about any two nodes, so there are lots of uninformative differences. Furthermore, the logging overhead would slow things down considerably, even if logging didn't print anything. What you probably want is to detect differences after it finds the largest isomorphic subgraph. The VF2 algorithm used by PyDelphin can perhaps be configured to do this, but Woodley and Stephan (IIRC) expressed skepticism that this would be tractable.

(3) seems to be a better goal, since it's not tied to the isomorphism check in PyDelphin. For this, we could look to EDM or smatch. EDM would require token mapping, which I don't think the Matrix grammars are configured for. Smatch currently requires PENMAN input, but PyDelphin can make the conversion. For both EDM and smatch, though, the MRS must be converted to a dependency representation (EDS or DMRS), which means that it must be a well-formed MRS, or nearly so. Alternatively, we could find some way to serialize the MRS with deterministically relabeled variables and sorted EPs/HCONS/ICONS/properties so that wdiff can better visualize the diffs between them.

olzama commented 4 years ago

(3) is the only thing I really need, with the caveat that if two different mechanisms are used to compare MRS to report an rtest failure and to find out what the exact details of the difference are, there may be inconsistency. So, ideally it would be one mechanism, but, from what I saw in the code that rtest uses and from what you are saying above, looks like it might not be straightforward to do.

I think that this:

Alternatively, we could find some way to serialize the MRS with deterministically relabeled variables and sorted EPs/HCONS/ICONS/properties so that wdiff can better visualize the diffs between them.

Might be the most realistic thing to do...

A separate tool using a dependency representation would have to be neatly integrated into the rtest pipeline somehow, in order for it to actually speed things up (if I need to extract the MRS from the test, convert it, feed it to the tool, visualize the output of the tool... may still be worth it compared to staring at the MRSs but remains to be seen, I guess), and with regression tests, there is no guarantee we have a well-formed MRS...

olzama commented 4 years ago

To talk a bit more about (1) and (2) though, do I understand right that what's intractable is in fact distinguishing informative diffs from uninformative? Using vf2.

goodmami commented 4 years ago

Yes, isomorphism detection in general is NP-complete, but the vf2 algorithm works to cut down the search space to make it work in reasonable time, along with assumptions we can make about our particular kinds of graphs. If we can exit early (e.g., different length of RELS list, etc.) this speeds it up more. If "informative" means only the non-matching part once the largest isomorphic subgraph has been found, that might be doable but it should be an alternative to the current behavior because of its potential to slow down things.

olzama commented 4 years ago

Gotcha; that's what I thought. In that case, I think ideally, we would have a faster rtest and a slower rtest. If the faster one returns a failure for a particular test, there would be an option to run the slower algorithm on the said test, and the slower algorithm would return not only a failure but also the detailed comparison. Something like that...

olzama commented 4 years ago

Honestly, if we could sort deterministically somehow, that would be tremendous help already... But I don't know what this means for the existing Matrix tests; the 511 tests which we have right now have already been created, and so they would have to be recreated, probably, using the new sorting procedure.

But, at the end of the day, if there is determinism in the order of the MRS and of the predications inside each MRS (both things are important), then using any string-diff tool becomes easy even if the variables are named differently because you can immediately tell whether the shape of the MRS is the same or not (and of course ideally the naming would also be deterministic, but sorting I think is more important).

Here's a typical situation: The second and the third MRS give me trouble simply because they got reordered. I could reorder them in the file manually and re-compare, but that of course slows things down quite a bit.

Screen Shot 2020-08-27 at 12 03 24 PM
goodmami commented 4 years ago

If it's just for the "slow" testing situation, we can just reorder both the gold and the current MRSs for comparison purposes. I think the EP order won't change often (maybe if the gold was parsed with the LKB and current with ACE), but if that is an issue, then I've detailed node-ordering criteria in section 5.4.1 of my dissertation, but without reasonably accurate CFROM/CTO values the result wouldn't be very intuitive.

I think we can improve things by just recreating the variables based on the order of appearance. And maybe also sorting the variable properties.