bjascob / amrlib

A python library that makes AMR parsing, generation and visualization simple.
MIT License
219 stars 34 forks source link

[Feature request] Alingments: AMR <-> AMR and AMR <-> Sentence #3

Closed flipz357 closed 4 years ago

flipz357 commented 4 years ago

Hi,

first, I want to say: You deserve a medal for creating this library. It is the first time I installed an AMR parser without getting a little headache :-). Also it's a nice idea to wrap the noRECAT variant of GSII and ablate all external java preprocessing. I think the noRECAT version may also be more robust.

I have two suggestions of which I think they would be cool to have in an amrlib:

  1. AMR2sent alignment: As far as I know, there exist aligners (for instance as pre-processing of JAMR parser), that align AMR nodes to tokens. Since often lemmas of the sentence are projected into the AMR graph, a simple string match, maybe with some additional rules, could make up a first solid method. Maybe there are other methods that are more suitable and also easy-to-use.

  2. AMR2AMR variable alignment: This could be useful, e.g., for computing AMR metrics or enabling sentence retrieval via AMR parsed corpora or sentence similarity computation via AMR. It is an NP hard problem but can be implemented via hill climbing maximizing triple matching. I have been working on this lately, here is a repo containing AMR metrics (Smatch and S2match) that are based on this alignment. Both alignments should be quite easy to implement in the lib, since it's all native python. (It could be worthwhile, though, to make the alignment faster, e .g., using cython, since it can be very slow for graphs with many variables)

Alas, these are just suggestions which may or may not be useful to have (in some near or distant future). Again, thanks for your awesome amrlib!

bjascob commented 4 years ago

I partially ported the JAMR aligner a while back and I can add this to the lib. I'd like to find the source for the aligner used for the LDC data but so far haven't been able to.

I'll dig into this a bit next weeek as I'm OOO for a while

lujiaying commented 4 years ago

The integrated aligner would be very helpful!

bjascob commented 4 years ago

Here's the status of things. Please comment if you have input on the direction of this feature...

I have a very simple rule based, "word aligner" from JAMR that I ported to python. It would need some updates / testing but it's reasonable to add this to amrlib. JAMR has several alignment methods and this is not the more complicated "span aligner" (ie.. phrases) that is the default alignment method it uses. That code is much more complicated (and in scala) so it would be a fair amount of work to port it into amrlib.

The ISI aligner, which was used to annotate LDC2020T02, is a word aligner. It does not align spans the way JAMR's default method does. Since the current LDC corpus annotations are word only alignments, I'm thinking that this is acceptable and that span alignments are not required.

There are better aligners out there (model based) since the early (rule based) JAMR one. I found the code for an ISI aligner (which may have been used to annotate LDC2020T02). That code is basically a bunch of scripts to train/run a model using the C++ MGIZA++ library. It would be a bit of a project, but not completely unreasonable, to make this usable either as part of amrlib, or more likely, as a stand-alone lib.

My thinking is that for now, the rule based word aligner if the best place to start. As part of implementing it, I can try to get an F1 score for it and the ISI alignments so we can see if it's performance is somewhat reasonable.

lujiaying commented 4 years ago

The reason why I want an aligner in amrlib is that "JAMR" is hard to install and somehow hard to integrate with python scripts.

I have seen several papers using JAMR because it is considered as a light aligner compared to model-based ones. So I think as long as the aligner performance is reasonable and easy to install, people like me would love to use it.

flipz357 commented 4 years ago

I agree with @lujiaying .

I also think that, from reading the AMR guidelines, there is no "theory of alignment". Therefore, maybe the token-node alignment may even be a bit more clear-cut than node-span.

E.g.

# The frog jumps.
(j / jump-01
    :arg0 (f /frog)

With token alignment it's clear that (f,frog) and (j, jumps) are correct alignments, however both (f, the frog) and (f, frog) can be considered correct in span alignment.

On the other hand, maybe span has advantages when concepts are really abstract

In sum, I think that a lightweight node-token alignment that performs reasonable would be a very good start, since it anchors the AMR in the sentence, which may already be quite helpful for some tasks.

bjascob commented 4 years ago

I have updated the master project with Rule Based Word aligner, similar in function to JAMR's word aligner. If you have comments or bugs, please post a new issue. (update to pip install to follow).