Amazing-J / structural-transformer

Code corresponding to our paper "Modeling Graph Structure in Transformer for Better AMR-to-Text Generation" in EMNLP-IJCNLP-2019
75 stars 8 forks source link

Cannot replicate the result in the paper #7

Open dungtn opened 5 years ago

dungtn commented 5 years ago

Hi @Amazing-J,

I tried to replicate the results of the baseline, feature-based and CNN/SA models. Here's the results that I got (I'm not sure which BLEU score you use, so I include a few of them

For the LDC2015E86 Baseline Dev Bleu_1: 0.495188 Bleu_2: 0.353920 Bleu_3: 0.261369 Bleu_4: 0.196578 METEOR: 0.309095 Test Bleu_1: 0.504782 Bleu_2: 0.349076 Bleu_3: 0.251628 Bleu_4: 0.184996 METEOR: 0.307734

Features Validation Bleu_1: 0.482722 Bleu_2: 0.327856 Bleu_3: 0.228882 Bleu_4: 0.161975 METEOR: 0.297407 Test Bleu_1: 0.488859 Bleu_2: 0.334304 Bleu_3: 0.236498 Bleu_4: 0.170757 METEOR: 0.297764

SA Validation Bleu_1: 0.524293 Bleu_2: 0.379137 Bleu_3: 0.282884 Bleu_4: 0.214684 METEOR: 0.327192 Test Bleu_1: 0.531576 Bleu_2: 0.389128 Bleu_3: 0.295101 Bleu_4: 0.228584 METEOR: 0.327300

CNN Validation Bleu_1: 0.495188 Bleu_2: 0.353920 Bleu_3: 0.261369 Bleu_4: 0.196578 METEOR: 0.309095 Test Bleu_1: 0.496904 Bleu_2: 0.355405 Bleu_3: 0.262857 Bleu_4: 0.198174 METEOR: 0.306878

The results are much worse for the LDC2017T10 dataset Baseline Validation Bleu_1: 0.204362 Bleu_2: 0.105597 Bleu_3: 0.062237 Bleu_4: 0.038863 METEOR: 0.110788 Test Bleu_1: 0.197050 Bleu_2: 0.104745 Bleu_3: 0.064901 Bleu_4: 0.042817 METEOR: 0.108762

Features Validation Bleu_1: 0.151187 Bleu_2: 0.079275 Bleu_3: 0.047097 Bleu_4: 0.030196 METEOR: 0.088049 Test Bleu_1: 0.156906 Bleu_2: 0.084370 Bleu_3: 0.052182 Bleu_4: 0.034406 METEOR: 0.088722

SA Validation Bleu_1: 0.216386 Bleu_2: 0.125283 Bleu_3: 0.079592 Bleu_4: 0.052227 METEOR: 0.123734 Test Bleu_1: 0.203880 Bleu_2: 0.123374 Bleu_3: 0.082714 Bleu_4: 0.058262 METEOR: 0.119108

CNN Validation Bleu_1: 0.234239 Bleu_2: 0.150478 Bleu_3: 0.103202 Bleu_4: 0.073315 METEOR: 0.146379 Test Bleu_1: 0.228974 Bleu_2: 0.150096 Bleu_3: 0.103783 Bleu_4: 0.073553 METEOR: 0.144331

I followed all the step listed on README as well as the answers from the repo issues. Am I missing something here?

Also, can you share the vocab size and the sequence length you used for the LDC2017T10 dataset? There are only numbers for the LDC2015E86 dataset according to the code.

xdqkid commented 5 years ago

How can u get so low bleu score? I do the reproduced exp & get bleu about 26 in the baseline system. Do U do something wrong in exp? like forgetting to tokenization. This is my bleu score in LDC2017T10 . BLEU = 26.17, 58.7/31.9/19.7/12.7 (BP=1.000, ratio=1.006, hyp_len=62719, ref_len=62369)

Amazing-J commented 5 years ago

We use BLEU-4 score by default, of course. We used the muti-bleu tool. First, make sure you are doing the right BPE operation. subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file} subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file} {num_operations}: LDC2017T10 is 20000; LDC2015E86 is 10000. {train_file}: train_src + train_tgt

Secondly, is your corpus sure that there is no problem? I am really surprised why your baseline scores are so low.

dungtn commented 5 years ago

Yes. I used the BPE operation as instructed. These are results of the hypothesis with bpe removed.

Here's the first 5 lines from the LDC2017T10 after tokenization:

date-entity :year 2002 :month 1 :day 5 country :wiki saudi_arabia :name ( name :op1 saudi :op2 arabia ) and :op1 international :op2 military :op3 terrorism say :arg0 ( university :wiki - :name ( name :op1 na@@ if :op2 arab :op3 academy :op4 for :op5 security :op6 sciences ) :arg1-of ( base :location ( city :wiki riyadh :name ( name :op1 riyadh ) ) ) ) :arg1 ( run :arg0 university :arg1 ( workshop :beneficiary ( person :quant 50 :arg1-of ( expert :arg2 ( counter :arg1 terrorism ) ) ) :duration ( temporal-quantity :quant 2 :unit week ) ) ) :medium statement re@@ open :arg1 ( university :wiki - :name ( name :op1 na@@ if :op2 arab :op3 academy :op4 for :op5 security :op6 sciences ) :purpose ( oppose :arg1 terror ) :mod ( ethnic-group :wiki arabs :name ( name :op1 arab ) :mod pan ) ) :time ( date-entity :year 2002 :month 1 :day 5 ) :frequency ( first :time ( since :op1 ( attack :arg1 ( country :wiki united_states :name ( name :op1 us ) ) :time ( date-entity :year 2001 :month 9 ) ) ) )

I'm pretty sure that there's no problem with the corpus. I double-checked it.

In the preprocess.sh, there are a few options

-src_vocab_size 30000 \ -tgt_vocab_size 30000 \ -src_seq_length 10000 \ -tgt_seq_length 10000

Is this the correct values for LDC2017T10?

Amazing-J commented 5 years ago

Because the source and the target share the vocabulary and after doing BPE, the vocabulary size should be around the BPE operands without much impact. I assume that you are authorized by LDC. Please send me your email and I can provide you with my baseline corpus.

dungtn commented 5 years ago

Yes, I am. I've sent out an email to the first author of the paper (assuming that it's you :-)) Thank you for helping me out.

vivald99 commented 5 years ago

Hi @Amazing-J , (cc @dungtn )

Thank you for sharing your work. I'm also having problems replicating the results from the paper. Could you, please, explain in detail how to preprocess the LDC AMR files or share the code to do that? I think that even a small mistake in the preprocessing step can lead to worse results. Maybe It is not clear in the README, how to proceed to generate the files to replicate the results, especially for SA and CNN models.

I used BPE (as you explained) and share vocab and got 23.13 BLUE, 30.64 METEOR and 56.84 CHFR++ in LDC2015E85 dev set. In the paper you report 24.93 BLEU, 33.20 METEOR and 60.30 CHFR++.

@dungtn If you have the code for preprocessing (for SA and CNN models) in a bash file or python script, it would be nice if you share it for replication. Could you share the files?

dungtn commented 5 years ago

@vivald99 Sure, I'm happy to share that (disclaimer: it's based on my understanding, I'm not 100% sure it's what the author did). You can find the code in my fork of this repo :-)

vivald99 commented 5 years ago

@dungtn thank you for your reply! I will run again the model with your preprocessing. Please, if you reproduce results that are similar to the paper, share the information.

dungtn commented 5 years ago

I checked the LDC2017T10 data, there're two differences

  1. The order in which the AMRs were processed (there are multiple files in the amrs directory and I processed them in different order than you did)
  2. There are a few places where the BPE is different, e.g., publication :arg1 ( lapse :arg1 ( memory...(yours) and publication :arg1 ( lap@@ se :arg1 ( memory... (mine)

@Amazing-J can you share the command that you use to linearize/anonymize the AMRs?

Is it ./anonDeAnon_java.sh anonymizeAmrFull true <amr_filename> from sinantie/NeuralAmr?