bjascob / amrlib

A python library that makes AMR parsing, generation and visualization simple.
MIT License
216 stars 33 forks source link

How to get alignment between english sentence word and AMR node ? #40

Closed code-010 closed 2 years ago

code-010 commented 2 years ago

First of all I am sorry if I asked any silly question. I am new to AMR and I am doing college project in AMR. Using amrlib I can parsed an english sentence and AMR texual representation.

Here is the code that I used to parse an english sentence

import spacy
import amrlib
amrlib.setup_spacy_extension()
nlp = spacy.load('en_core_web_sm')
doc = nlp('What did the girl find ?')
graphs = doc._.to_amr()

for graph in graphs:
    print(graph)

and got the below output

 # ::snt What did the girl find ?
 # ::tokens ["What", "did", "the", "girl", "find", "?"]
 # ::ner_tags ["O", "O", "O", "O", "O", "O"]
 # ::ner_iob ["O", "O", "O", "O", "O", "O"]
 # ::pos_tags ["WP", "VBD", "DT", "NN", "VB", "."]
 # ::lemmas ["what", "do", "the", "girl", "find", "?"]
(f0 / find-01
      :ARG0 (g0 / girl)
      :ARG1 (a0 / amr-unknown))

But my question is that how can I get alignment between AMR node and input english sentence words i.e. something like # :: alignments 4-5|0 3-4|0.0 0-1|0.1

bjascob commented 2 years ago

The old parsers (ie.. JAMR) used to do alignments as part of their parsing process but the newer transformer based aligners don't need to do this. However, there is functionality in amrlib for this as a post-processing operation. See FAA Aligner.

Make sure you notice... !! Note that the input sents need to be space tokenized strings. as the numbering of words in the original text will is based on space tokenization. Also note that the alignment_string format is ISI (same as the AMR-3 corpus's alignments) not JAMR.

If you want to add the alignment string back into the metadata you need to do that manually via string manipulation or using the penman library to manipulate the graphs and adding the string to the metadata under the key alignments. ie..

pgraph = penman.deocde(graph)
pgraph.metadata['alignments'] = alignment_string
code-010 commented 2 years ago

Thank you

code-010 commented 2 years ago

I got into an error while running FAA Aligner i.e. assert len(space_tok_sents) == len(graph_strings) AssertionError

As you have gave documenation link of FAA Aligner , I have used the code provided in the documentations

from amrlib.alignments.faa_aligner import FAA_Aligner
inference = FAA_Aligner()
amr_surface_aligns, alignment_strings = inference.align_sents(sents, graph_strings)
print(alignment_strings)

I have provided space tokenized sentence i.e. sent = ["What", "did", "the", "girl", "find", "?"] and graph_strings = "(f0 / find-01\n\t:ARG0 (g0 / girl)\n\t:ARG1 (a0 / amr-unknown))"

But I am getting error as below


Traceback (most recent call last):
  File "faa.py", line 4, in <module>
    amr_surface_aligns, alignment_strings = inference.align_sents(["What", "did", "the", "girl", "find", "?"], graph_str1)
  File "/usr/local/lib/python3.8/dist-packages/amrlib/alignments/faa_aligner/faa_aligner.py", line 37, in align_sents
    assert len(space_tok_sents) == len(graph_strings)
AssertionError

Am I doing something wrong ? if yes, can you give an example of input sents & graph_strings .

bjascob commented 2 years ago

You're giving it a list of tokens where it's expecting a string with spaces between the tokens. Do something like
sent = ' '.join(sent) . It's also supposed to take a list of sentences and a list of graph. If you only have one of each, just put them inside brackets, ie.. [sent] so they form a list.