coli-saar / am-parser

Modular implementation of an AM dependency parser in AllenNLP.
Apache License 2.0
30 stars 10 forks source link

Extend mtool to convert MRP -> AMR #2

Open alexanderkoller opened 5 years ago

alexanderkoller commented 5 years ago

In order to use the old AMR preprocessing pipeline with AMR, the easiest solution would be to simply convert the MRP data to AMRBank format. Let's be good citizens and simply add code to do this to the official mtool tool.

The way to do this is as follows:

weissenh commented 5 years ago

A first version of the conversion is implemented. However, it requires the input to already contain a inverted edges (e.g. ARG1-of instead of ARG1) where necessary. For the sample data wsj amr as well as the MRP training data this seems to be the case, so the currently implemented simple depth first search will suffice.

Why I haven't requested a pull yet:

namednil commented 5 years ago

Unfortunately, the conversion is not lossless :( Sometimes, things like "quant" are used as properties and sometimes they are used as edges, so turning them into edges and making them properties again doesn't work, because there's ambiguity. See for instance the graph PROXY_AFP_ENG_20020416_0331.17 o

namednil commented 5 years ago

So, I'd suggest to add a suffix, e.g. _prop to those things that are meant to be interpreted as properties... We then have to adapt the code for the decomposition though to take the new edge labels into account.

alexanderkoller commented 5 years ago

Good catch. Before we invest into this, can we quantify how much this hurts us?

namednil commented 5 years ago

Not in isolation. With all pre- and postprocessing, when I compare a random subset of the decomposable graphs against their gold graphs, I get the following MRP scores:

{"n": 1491,
 "exact": 1487,
 "tops": {"g": 1491, "s": 1491, "c": 1490, "p": 0.9993293091884641, "r": 0.9993293091884641, "f": 0.9993293091884641},
 "labels": {"g": 14765, "s": 14431, "c": 14234, "p": 0.9863488323747488, "r": 0.9640365729766339, "f": 0.9750650774078641},
 "properties": {"g": 2412, "s": 2723, "c": 2237, "p": 0.821520381931693, "r": 0.9274461028192371, "f": 0.8712755598831547},
 "anchors": {"g": 0, "s": 0, "c": 0, "p": 0.0, "r": 0.0, "f": 0.0},
 "edges": {"g": 14718, "s": 13692, "c": 13599, "p": 0.9932077125328659, "r": 0.9239706481858948, "f": 0.9573389651531151},
 "attributes": {"g": 0, "s": 0, "c": 0, "p": 0.0, "r": 0.0, "f": 0.0},
 "all": {"g": 33386, "s": 32337, "c": 31560, "p": 0.9759717970127099, "r": 0.9453064158629365, "f": 0.9603943824840618},
 "time": 216.625990977}

Other sources of information loss are:

alexanderkoller commented 5 years ago

But from your data it looks like we actually do quite badly on the "properties" score, which could be explained by the lossiness of the conversion. Right? So then this would indeed be worth fixing.

namednil commented 5 years ago

Yes, that's probably true. The negative effect of the relabeling on the property score is probably very minor.

alexanderkoller commented 5 years ago

I just factored this into a separate issue #25 and took the liberty of assigning Ms. Weißenhorn to it.

alexanderkoller commented 5 years ago

Given that the AMR conversion itself is done, shall we relabel this issue as post-mrp-cleanup and postpone the actual pull request until after the deadline?

weissenh commented 5 years ago

Mhm, I wonder how mtool was able to convert amrs to mrps correctly without special _prop suffixes for properties. To me it seems like mtool was able to detect properties in a normal penman graph, so I wonder whether we can use the same strategy and don't have to add special suffixes to some edges.

namednil commented 5 years ago

I think they make a distinction between nodes with names and those that don't have names. (At least, that's my perspective). Nodes without names probably become properties and nodes with names become actual nodes in the MRP graph. However, I'm not completely sure whether the annotators would agree with this interpretation of the AMR penman string because it has weird consequences (or this is an annotation error...?):

bolt-eng-DF-170-181106-8873009_0024.7: "as 7 belong to Bush and 1 to Obama."

satz

In any case, this distinction doesn't work for us because the Alto code doesn't know properties and works with graphs that have nodes, edges (and sources), nothing more.

weissenh commented 5 years ago

Maybe the bolt-eng-DF-170-181106-8873009_0024.7 in the dfb part of the training corpus is an annotation error. In the mrp file the ARG0 7 and ARG0 1 are said to be property-value pairs, although I would agree with you that this is somewhat weird. Anyways, I guess that when Alto reads in AMR graphs it needs to create a new node for the value of a property. If these newly created nodes have special names (explicitanon...), maybe you can identify such special nodes and convert them back to property value pairs for the mrp output? If that's too complicated I can modify the AMR write method for now (if you like mtool to also read such amr, I would also need to modify the read method and that's going to be more complicated I guess).

Regarding postponing the pull request for mtool until after the deadline (@alexanderkoller ): Yes, I would like to postpone it. One of the reasons is that the amr write method currently won't work for graphs which don't contain inverted edges (x-of) where necessary to allow a simple depth first search to succeed. Since all graphs in the mrp training data already contain such inverted edges where necessary the amr write method works for now, but I would like to eliminate this limitation before sending a pull request. If we change the code to add suffixes to properties like @namednil suggests this can be used as an argument against accepting a pull request, because :polarity_prop or :quant_prop are not one of the usual AMR edges/properties/whatever.