Closed sctratz closed 4 years ago
Hi Stephen, thanks for reaching out! First of all, did you manage to reproduce the results with our pretrained model (https://github.com/coli-saar/am-parser#reproducing-our-experiment-results)?
In order to hunt down the problem with the retraining, we need to know a little more. In particular:
We suspect that we created a slight backwards incompatibility when we worked on am-tools in preparation of the MRP shared task -- we haven't rerun the entire training pipeline on AMR-2017 since.
In general, it would be useful if you sent us your test.amconll
file that the parser created and the (final) predicted AMR graphs.
I'll send the amconll and amr files in an email since they contain LDC data.
Versions used: The results I reported were produced using recent versions of am-parser and am-tools.
Timeout option: I used predict.sh and did not use the -f flag
I just tried cloning am-parser again and using the pretrained model, I get the following results
Precision: 0.7686 Recall: 0.7122 F-score: 0.7393
Thanks! I spotted some problems and could reproduce the situation. Indeed, by modifying the am-tools code for the shared task, we created some incompatibilities. For reference, the issues were:
Within the tratz
branch of am-tools, these issues should be solved. I now get
Precision: 0.7767
Recall: 0.7342
F-score: 0.7549
with our pretrained model on the test set, one sentence was skipped. (In the next few days, we'll also upload a slight correction to the model files that could raise the F-score to about 75.8, by applying a bugfix that we made during the shared task.)
Let me know if the changes in the branch of am-tools don't work as expected.
I'll next check that these fixes work as expected for preparing the training data such that retraining the model will give a performance in the expected range.
Again and again I am amazed what an impact the preprocessing has on the Smatch score.
Just a quick update on this. I reran the entire pipeline (with the tratz
branch) and got the following results on the test set:
Precision: 0.7816
Recall: 0.7377
F-score: 0.7590
that includes the bugfix I mentioned above
(In the next few days, we'll also upload a slight correction to the model files that could raise the F-> score to about 75.8, by applying a bugfix that we made during the shared task.)
If you want, you can try again now but I'm planning on making the pipeline easier to run so that you don't have to move so many files around by hand.
Quick update on this: I am now convinced that the changes didn't introduce new issues in the processing of the MRP pipeline, so I merged the branch.
If you download a current version of am-tools, compile it and replace the automatically downloaded version, predict.sh
should give you Smatch scores in the right range.
The simplification of the pipeline and better documentation, which will be useful if you want to train from scratch, is not complete yet but I'm on it.
I cleaned up the scripts and added a small example for how to decompose an AMR corpus. The documentation for AMR should be better now. Please have a look at https://github.com/coli-saar/am-parser/wiki/AMR-Preprocessing.
I'm closing the issue for now. Feel free to re-open it if you get unexpected results.
Hello,
I am trying to reproduce the results reported in your ACL 2019 paper for AMR 2017. The documentation isn't the clearest, but I was eventually able to retrain the system.
Here are my current results.
Precision: 0.7542 Recall: 0.7173 F-score: 0.7353
The F-score is lower than the 75.3 reported in the paper. Any ideas of where to start to fix this?
Thanks!
--Stephen