bjascob / amrlib

A python library that makes AMR parsing, generation and visualization simple.
MIT License
216 stars 33 forks source link

Incorrect PENMAN with multi word expression #27

Closed jheinecke closed 2 years ago

jheinecke commented 2 years ago

Hi, thanks for this great tool ! I am using it with the T5 parser model, and I stumbled across an error for a sentence with an abbreviation: Using SEO to Inform Your Website Content Strategy. Somewhere during the decoding it looks like that SEO is replaced with search engine (optimization) so the created PENMAN graph is syntactically incorrect (taken from the parse_sents() method amrlib/amrlib/models/parse_t5/inference.py just before the call to gstring = PenmanDeSerializer(g).get_graph_string() (variable gin line 70) when the instances not yet given.

https://github.com/bjascob/amrlib/blob/7ddb4dd59f463ffc6e9d659d01b3b36b98d71afe/amrlib/models/parse_t5/inference.py#L69-L71

( use-01
    :ARG1 ( search engine    #<--- here is a invalid space
    :name ( name 
        :op1 "SEO" ) )
   :ARG2 ( inform-01 
    :ARG0 use-01
    :ARG1 ( strategy
         :topic ( content 
            :mod ( website ) )
         :poss ( you ) ) ) )

I guess this can happen with seq2seq models, but is there anything to do to avoid spaces in concept names ?

bjascob commented 2 years ago

Unfortunately it's the T5 model itself that has learned to do this. It must be in the original training data because it's not in the AMR dataset. This means there isn't any simple and general method to modify this without some level of re-training on the model.

In general, AMR parsing with seq-to-seq models is very messy as they often don't produce valid graphs. The default method is to produce 4 graphs and choose the first one (highest scoring) that decodes properly. For this particular sentence, all 4 fail to deserialize. You can change the number of candidates the model produces initially by increasing the beam size when you load the model with stog = amrlib.load_stog_model(num_beams=8). Doing this, you see it fails on the first 4 beams but finally finds a good one and returns the result...

>>> stog = amrlib.load_stog_model(num_beams=8)
>>> graphs = stog.parse_sents(['Using SEO to Inform Your Website Content Strategy.'])
>>> print(graphs[0])
# ::snt Using SEO to Inform Your Website Content Strategy.
(u / use-01
      :ARG1 (s / SEO)
      :ARG2 (ii / inform-01
            :ARG1 (s2 / strategy
                  :topic (c / content
                        :mod (w / website))
                  :poss (y / you)))
      :ARG0-of ii)

I also appears to fix the issue if you quote "SEO"

>>> stog = amrlib.load_stog_model()
>>> graphs = stog.parse_sents(['Using "SEO" to Inform Your Website Content Strategy.'])
>>> print(graphs[0])
# ::snt Using "SEO" to Inform Your Website Content Strategy.
(u / use-01
      :ARG1 (s / search-engine
            :name (n / name
                  :op1 "SEO"))
      :ARG2 (ii / inform-01
            :ARG1 (s2 / strategy
                  :topic (c / content
                        :mod (w / website))
                  :poss (y / you)))
      :ARG0-of ii)

Probably, the easiest thing to do on failures is to re-run them with a higher beam size.

jheinecke commented 2 years ago

Thanks, I didn't see this parameter, I'll try it. I just had the idea that if the PenmanSerializer (or something different) in line 70 would be capable to repair "simple" formal errors (like the spaces in concepts or missing final parentheses), could we get even better analyses ? Do you know how often the best scored solution cannot be decoded and therefore skipped ?

bjascob commented 2 years ago

Take a look at PenmanDeserializer. The code here is trying to reconstruct a properly formed graph even when there are errors in the sequence output. I think I made the assumption that the output is space tokenized and I don't allow for multi-token node names (I don't think AMR allows for multi-word node names either). This could probably be modified by adding some logic but you'll need to spend some time to understand how this works as it's a bit complicated already.

Getting the deserializer to work properly was a bit of a chore but the current code seems to work pretty well from a metrics standpoint. For the current T5 model with PenmanDeserializer, out of the 1898 AMR test graphs, only 1 failed to deserialize properly with a beam size of 1 and all produced results when the beam size was set to 4.

jheinecke commented 2 years ago

OK, I'll have a look into that. Another question: for testing (and using amrlib) are sentences POS tagged before sent into the model to predict the AMR-graph? Or is spacy used only for training. As far as I have understood amrlib/amrlib/models/parse_t5/inference.py you sent the raw text through the tokenizer into the model, have I understood that correctly?

bjascob commented 2 years ago

SpaCy's POS tagging, tokenization, lemmatization, etc... are only used for the parse_gsii model. For the parse_T5 model, it has its own tokenizer built-in to the Huggingface transformers library code that it's part of. It doesn't require POS tagging or any other type of preprocessing from SpaCy.