amrisi / amr-guidelines

239 stars 86 forks source link

AMR parsing expression grammar #202

Open mpertierra opened 7 years ago

mpertierra commented 7 years ago

Is there an official parsing expression grammar (PEG) for AMR? I've tried writing one myself and this is the closest I've come up with, although I'm sure it's not complete.

constant <- r'([^ \t\n()":]+)|("[^"]+")|(-?[0-9]+)|(-?[0-9]+\.[0-9]+)';
variable <- r'[^ \t\n()":]+';
concept <- r'([^ \t\n()":]+)|("[^"]+")';
relation <- r':[^ \t\n()":]+';
amr <- constant / ( '(' variable '/' concept (relation amr)* ')' );
root <- amr EOF;
nschneid commented 7 years ago

An unofficial one which may not be perfect: https://github.com/nschneid/amr-hackathon/blob/master/src/amr.peg

mpertierra commented 7 years ago

@nschneid That looks great, thank you! Have there been significant changes to the AMR specs since you wrote that?

nschneid commented 7 years ago

Not significant changes, but there may be edge cases that show up in newer data. E.g., negative numbers (nschneid/amr-hackathon#2).

mpertierra commented 7 years ago

@nschneid it seems that smatch is now being hosted on GitHub (https://github.com/snowblink14/smatch). The link to the repository is found here (http://amr.isi.edu/evaluation.html). It does not use a PEG to read in AMR annotations and instead it processes them by character. Is this the closest there is to an "official" AMR annotation parser?

nschneid commented 7 years ago

I believe so. The smatch implementation has been tested more thoroughly then mine.

danielhers commented 7 years ago

Unfortunately the smatch version does not handle alignments, which @nschneid's version does. But the conventions for these seem to be completely nonexistent.

danielhers commented 7 years ago

What are the differences with respect to the grammar used for general PENMAN notation? https://github.com/goodmami/penman#penman-notation

goodmami commented 7 years ago

@danielhers, note that I'm not affiliated with ISI or the AMR group, so my module is not official in any way. I haven't done an actual comparison, but I intentionally made the grammar a bit more forgiving. E.g., I allow node identifiers (variables) to contain anything that isn't a space, a / or a ), and for strings I allow anything inside the quotes, even escaped quotes. I don't (yet) do character alignments, because, as you say, there is no established format I've seen outside of @nschneid's implementation. In the default configuration, I also allow empty nodes (e.g. (a)) and anonymous relations ((a : b)). I don't believe these are valid in AMR, or possibly even in the original PENMAN SPL. I hope to offer an AMR-specific subclass of my PENMANCodec class that restricts the grammar to what is allowed in AMR (and also to manage inversions like :domain <-> :mod), but I haven't had a chance to do this yet (contributions are welcome; and thanks for your previous PR).

Also, on the README I defined the PEG grammar that the module abides by, but I don't use a PEG parsing algorithm (e.g. packrat or even recursive descent). But decoding is pretty well tested: https://github.com/goodmami/penman/blob/master/tests/test_penman.py#L44-L139

danielhers commented 7 years ago

Thanks @goodmami, your module is very useful to me as it allows dynamically creating AMRs easily and printing them (which @nschneid's doesn't really yet). It might not be official but it does the job :)