coli-saar / am-parser

Modular implementation of an AM dependency parser in AllenNLP.
Apache License 2.0
30 stars 10 forks source link

Get NEW AMR pipeline to work again #7

Closed alexanderkoller closed 5 years ago

alexanderkoller commented 5 years ago

AMR has a much more complex pipeline than all the other graphbanks. Make sure we understand how to run it, and document it in the am-parser wiki.

Once we have converted the MRP training set (see #2, #3), train the parser on AMR, document how to do it on the am-parser Wiki, and record the results in a Google Sheet.

@megodoonch I'm adding you to this issue because we'll probably need your help.

luciaelizabeth commented 5 years ago

Update from today's meeting: @weissenh and @megodoonch will work together to get the pipeline working. This includes:

  1. Mtool conversion (@weissenh)
  2. Preprocessing (@megodoonch; revise documentation when possible)
  3. Run (@namednil)
  4. Postprocessing

Tentative goal: End of week (July 5)

We would also like to visualise AMR graphs to make sure that the edges go in the correct directions (especially for :x-of arguments).

At a later date, @namednil will look into optimization of the pipeline: e.g. remedying node labels that become dashes.

weissenh commented 5 years ago

About (1.) Mtool add write to amr option: I looked at how mtool reads in these :x-of edges (both read in from mrp as well as amr) and it seems to me that the original x-of string is stored in the Edge.lab attribute, whereas the 'normalized' version (e.g. x is stored in Edge.normal). So when writing such a graph object to amr, I can just use the lab field of the Edge class. Btw I looked at an example with a :mod edge, which was represented in the Graph object by an Edge object where Edge.lab = "mod" and Edge.normal = "domain". So assuming that we start with a graph input that already contains ARG1-of edges where necessary, we don't have to do anything special with these edges. For instance the sample data for amr (see mtool/data/sample/amr ) already contains such x-of edges.

megodoonch commented 5 years ago

Pia and I met this afternoon. We got the preprocesser running on a mini-corpus, but there are a bunch of warnings that might be a problem. I'm trying to interpret them now.

megodoonch commented 5 years ago

Looks like the errors are just warnings, so Pia has gone ahead with preprocessing the new corpus.

However, probably due to the large training corpus, it's taking longer than expected. It's been running for half a day now, and it's not done.

megodoonch commented 5 years ago

Here is the Wordnet error we keep getting. Alexander looked into the Wordnet code, and it arises when there is an empty string as argument to Wordnet.

*** WARNING *** java.lang.IllegalArgumentException
        at edu.mit.jwi.morph.SimpleStemmer.normalize(SimpleStemmer.java:190)
        at edu.mit.jwi.morph.WordnetStemmer.findStems(WordnetStemmer.java:72)
        at edu.mit.jwi.morph.SimpleStemmer.getVerbCollocationRoots(SimpleStemmer.java:389)
        at edu.mit.jwi.morph.SimpleStemmer.findStems(SimpleStemmer.java:160)
        at edu.mit.jwi.morph.WordnetStemmer.findStems(WordnetStemmer.java:95)
        at de.saar.coli.amrtools.aligner.WordnetEnumerator.getWNCandidates(WordnetEnumerator.java:102)
        at de.saar.coli.amrtools.aligner.CandidateMatcher.findCandidatesForProb(CandidateMatcher.java:63)
        at de.saar.coli.amrtools.aligner.Aligner.probabilityAlign(Aligner.java:210)
        at de.saar.coli.amrtools.aligner.Aligner.main(Aligner.java:167)
        at de.saar.coli.amrtools.datascript.RawAMRCorpus2TrainingData.main(RawAMRCorpus2TrainingData.java:119)
alexanderkoller commented 5 years ago

More precisely, when the argument consists only of whitespace.

megodoonch commented 5 years ago

The preprocessing is done!

megodoonch commented 5 years ago

is this a warning we need to worry about?

***WARNING*** found too many nodes with incoming edges in alignment! [h, y, h2]
megodoonch commented 5 years ago

Is this a problem?

Successes for absolute alignments: 14500/14499
Successes for probabilistic alignments: 14500/14499
***probably infinite loop! breaking out of it.
{null={0}}
{2, 6, 4, 15, 14, 12, 13, 8, 9, 11, 10, 1, 3, 7, 5, 16, 17, 19, 18, 21, 20}
[null!||0-1||1.0]
Failed: 14933 (size 1)
***probably infinite loop! breaking out of it.
{null={0}}
{2, 6, 4, 15, 14, 12, 13, 8, 9, 11, 10, 1, 3, 7, 5, 16, 17, 19, 18, 21, 20}
[null!||0-1||1.0]
Successes for absolute alignments: 14999/14999
Successes for probabilistic alignments: 14999/14999
megodoonch commented 5 years ago

It looks like the dev set aligner didn't run due to a null pointer exception.

Running aligner (basic)
Exception in thread "main" java.lang.NullPointerException
        at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
        at java.util.regex.Matcher.reset(Matcher.java:309)
        at java.util.regex.Matcher.<init>(Matcher.java:229)
        at java.util.regex.Pattern.matcher(Pattern.java:1093)
        at de.up.ling.irtg.corpus.Corpus.readCorpusWrapper(Corpus.java:258)
        at de.up.ling.irtg.corpus.Corpus.readCorpus(Corpus.java:209)
        at de.saar.coli.amrtools.aligner.Aligner.main(Aligner.java:143)
        at de.saar.coli.amrtools.datascript.RawAMRCorpus2TrainingData.main(RawAMRCorpus2TrainingData.java:119)

While running the command

java -Xmx600G -cp alto-2.3-SNAPSHOT-jar-with-dependencies.jar de.saar.coli.amrtools.datascript.RawAMRCorpus2TrainingData -i amr-prepro/corpus/dev/ -o amr-prepro/data/alto/dev/ -g data/englishPCFG.txt --corefSplit -t 50 --minutes 20 -w data/wordnet/dict/ -pos data/englis\
h-bidirectional-distsim.tagger >>amr-prepro/preprocessLog 2>&1
megodoonch commented 5 years ago

One graph threw this:

423
("""a<root> " " " / " " " and  :op1 (s / string-entity  :value (explicitanon0 / charity))  :op2 (s1 / string-entity  :value (explicitanon1 / "/_UNK_CHAR_t_UNK_CHAR__UNK_CHAR_r_UNK_CHAR_ti/"))  :op3 (s2 / show-01  :ARG1 (s3 / string-entity  :value (explicitanon2 / [char-\
i-tee])  :ARG2-of (s4 / spell-01)))  :op4 (s5 / show-01  :ARG1 (a1 / alphabet  :mod (p / phonetic)  :mod (i / international)))  :op5 (n / noun)  :op6 (h / have-mod-91  :ARG1 (p1 / plural)  :ARG2 (s6 / string-entity  :value (explicitanon3 / -ties)))""")
java.lang.NullPointerException
        at org.jgrapht.graph.AbstractGraph.assertVertexExist(Unknown Source)
        at org.jgrapht.graph.AbstractBaseGraph$DirectedSpecifics.getEdgeContainer(Unknown Source)
        at org.jgrapht.graph.AbstractBaseGraph$DirectedSpecifics.edgesOf(Unknown Source)
        at org.jgrapht.graph.AbstractBaseGraph.edgesOf(Unknown Source)
        at de.up.ling.irtg.algebra.graph.AMSignatureBuilder.getConstantsForAlignment(AMSignatureBuilder.java:454)
        at de.saar.coli.amrtagging.AlignmentTrackingAutomaton.create(AlignmentTrackingAutomaton.java:269)
        at de.saar.coli.amrtagging.DependencyExtractorCLI.lambda$main$3(DependencyExtractorCLI.java:167)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
megodoonch commented 5 years ago

The main worry is that there might be too many graph constants that want to have multiple roots. There are two variants of this error, it seems. Here's one:

***WARNING*** more than one edge at node null
[ARG0, ARG1]
megodoonch commented 5 years ago

The other type is this:

3904
(p<root> / purpose-01  :ARG2 (y / you  :ARG1-of (i / inform-01  :ARG1-of p)))
java.lang.IllegalArgumentException: Cannot create a constant for this alignment (p!|i||0-1||1.0): More than one node with edges to outside.
megodoonch commented 5 years ago

I have to look into the code to make sure this is what is being reported, but during decomposition we occasionally get reports like this:

Successes: 19629/19999

If this is indeed the proportion of decomposable graphs, then we're really doing fine. The last such number printed is

Successes: 51579/55999
megodoonch commented 5 years ago

Pia has looked into this, but we're not sure what it is.

20322
(a<root> / and  :op1 (p / possible-01  :manner (a1 / amr-unknown)  :ARG1 (t / take-01  :ARG1 (h / honor-01  :ARG1-of (t1 / take-01  :ARG0 (a2 / amr-unknown)  :condition (t2 / take-01  :ARG1 h)  :op2-of a)))))
java.lang.UnsupportedOperationException: Graph cannot be represented as AMR: unvisited nodes.
        at de.up.ling.irtg.codec.SgraphAmrOutputCodec.write(SgraphAmrOutputCodec.java:63)
        at de.up.ling.irtg.codec.SgraphAmrOutputCodec.write(SgraphAmrOutputCodec.java:40)
        at de.up.ling.irtg.codec.OutputCodec.asString(OutputCodec.java:98)
        at de.up.ling.irtg.algebra.graph.SGraph.toIsiAmrStringWithSources(SGraph.java:626)
        at de.up.ling.irtg.algebra.graph.AMSignatureBuilder.getConstantsForAlignment(AMSignatureBuilder.java:693)
        at de.saar.coli.amrtagging.AlignmentTrackingAutomaton.create(AlignmentTrackingAutomaton.java:269)
        at de.saar.coli.amrtagging.DependencyExtractorCLI.lambda$main$3(DependencyExtractorCLI.java:167)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
weissenh commented 5 years ago

Note to myself: on the MRP website in the section training data they say that the AMR corpus consists of 57,885 sentences. However, in /proj/irtg/sempardata/mrp/LDC2019E45/2019/training/amr I can only find 56,240 graphs. Is the number of sentences != the number of graphs? Are we using a different training corpus? It's not due to my mrp2amr conversion, I also checked the mrp files directly, the number is the same 56,240.

alexanderkoller commented 5 years ago

Good catch. Maybe ask the organizers?

luciaelizabeth commented 5 years ago

Yes, thanks! I will write now.

megodoonch commented 5 years ago

The number of undecomposable graphs is about 10%, which is the expected number. I think Pia and I can say that the preprocessing pipeline is running fine. (We just need to change it now to make it compatable with the task guidelines)

luciaelizabeth commented 5 years ago

@weissenh: response from organizers about amr graphs

"thanks for double-checking those counts! there is indeed one graph per sentence, and our counts on the MRP web pages were just wrong. 56,240 is the right number!"

very nice catch! good to go ahead.

alexanderkoller commented 5 years ago

Can this issue be closed?

luciaelizabeth commented 5 years ago

Another response from Tim O'Gorman: Thanks for pointing this out! To give some background -- AMR has an official "development" split, and the 57,885 number includes that dev set. Since the other formats didn't have development sets, we ended up only releasing training data for each (plus "wsj.mrp", which is in that dev set), leading to the split. We ended up only releasing the actual training data as training data -- I think it's arguable that we should have included that dev data, but it's far too late to include it, and it shouldn't affect things that much.

megodoonch commented 5 years ago

HALLELUIAH it works on the minicorpus. Java changes pushed to am-tools github. now to find a place for the bash scripts.