[AMR] Reproducing ACL 2019 paper results difficult

sctratz commented 4 years ago

Hello,

I am trying to reproduce the results reported in your ACL 2019 paper for AMR 2017. The documentation isn't the clearest, but I was eventually able to retrain the system.

Here are my current results.

Precision: 0.7542 Recall: 0.7173 F-score: 0.7353

The F-score is lower than the 75.3 reported in the paper. Any ideas of where to start to fix this?

Thanks!

--Stephen

namednil commented 4 years ago

Hi Stephen, thanks for reaching out! First of all, did you manage to reproduce the results with our pretrained model (https://github.com/coli-saar/am-parser#reproducing-our-experiment-results)?

In order to hunt down the problem with the retraining, we need to know a little more. In particular:

what versions you used of am-tools (the one automatically downloaded? a (recent?) version from the am-tools repository?) and of am-parser
what timeout you used for parsing the test set (the local variable give_up in configs/test_evaluators or the --give_up option if you used parse_file.py, or the -f flag if you used predict.sh) / if some sentences were skipped

We suspect that we created a slight backwards incompatibility when we worked on am-tools in preparation of the MRP shared task -- we haven't rerun the entire training pipeline on AMR-2017 since.

In general, it would be useful if you sent us your test.amconll file that the parser created and the (final) predicted AMR graphs.

sctratz commented 4 years ago

I'll send the amconll and amr files in an email since they contain LDC data.

Versions used: The results I reported were produced using recent versions of am-parser and am-tools.

Timeout option: I used predict.sh and did not use the -f flag

I just tried cloning am-parser again and using the pretrained model, I get the following results

Precision: 0.7686 Recall: 0.7122 F-score: 0.7393

namednil commented 4 years ago

Thanks! I spotted some problems and could reproduce the situation. Indeed, by modifying the am-tools code for the shared task, we created some incompatibilities. For reference, the issues were:

no wiki links for multi-token named entities
usage of a different named entity tagset (used different API of CoreNLP and this actually also leads to different models being used)
use of spaces instead of "_" for joining named entities (impact unclear)
a tiny change in capitalization that made an existing bug worse ("REPL bug")

Within the tratz branch of am-tools, these issues should be solved. I now get

Precision: 0.7767
Recall: 0.7342
F-score: 0.7549

with our pretrained model on the test set, one sentence was skipped. (In the next few days, we'll also upload a slight correction to the model files that could raise the F-score to about 75.8, by applying a bugfix that we made during the shared task.)

Let me know if the changes in the branch of am-tools don't work as expected.

I'll next check that these fixes work as expected for preparing the training data such that retraining the model will give a performance in the expected range.

alexanderkoller commented 4 years ago

Again and again I am amazed what an impact the preprocessing has on the Smatch score.

namednil commented 4 years ago

Just a quick update on this. I reran the entire pipeline (with the tratz branch) and got the following results on the test set:

Precision: 0.7816
Recall: 0.7377
F-score: 0.7590

that includes the bugfix I mentioned above

(In the next few days, we'll also upload a slight correction to the model files that could raise the F-> score to about 75.8, by applying a bugfix that we made during the shared task.)

If you want, you can try again now but I'm planning on making the pipeline easier to run so that you don't have to move so many files around by hand.

namednil commented 4 years ago

Quick update on this: I am now convinced that the changes didn't introduce new issues in the processing of the MRP pipeline, so I merged the branch.

If you download a current version of am-tools, compile it and replace the automatically downloaded version, predict.sh should give you Smatch scores in the right range.

The simplification of the pipeline and better documentation, which will be useful if you want to train from scratch, is not complete yet but I'm on it.

namednil commented 4 years ago

I cleaned up the scripts and added a small example for how to decompose an AMR corpus. The documentation for AMR should be better now. Please have a look at https://github.com/coli-saar/am-parser/wiki/AMR-Preprocessing.

I'm closing the issue for now. Feel free to re-open it if you get unexpected results.

coli-saar / am-parser

[AMR] Reproducing ACL 2019 paper results difficult #80