Unable to recreate paper results

tagoyal commented 5 years ago

Hi, I am trying to train a new model with the ldc2017 data. I took the config parameters from the gold config files. However, I get a very low blue score on the test data (around 7 bleu).

Could you point me to the correct training procedure and parameters you used for training?

Thanks, Tanya

freesunshine0316 commented 5 years ago

Hi, How did you preprocess the data? You need to simplify the AMR side and tokenize the English side with a PTB_tokenizer.

tagoyal commented 5 years ago

Hi, I simplify the amr using the code that you have provided, then create the jsons. I then on to the vocabulary extraction part, with code in the data sub-directory.

When exactly should the tokenization happen?

tagoyal commented 5 years ago

Could the difference be because of the change in dataset? I am running it on the LDC2017T10 dataset? Although the difference between the accuracy I get and the paper accuracy is pretty huge.

Could you share the dataset used in the paper? Thanks, Tanya Goyal

freesunshine0316 commented 5 years ago

I think the simplifier does tokenization for you, and you could use your own tokenizer too. If you directly use my released model for decoding, then you have to use the released word embeddings also. Otherwise it's wrong. I don't think the dataset is the problem, as someone else told me that my model generalizes well on them.

tagoyal commented 5 years ago

I don't use the released model, I use the code to train my own model. This is the config file I use:

{ "train_path": "data/amr2.0_gold/training.json", "finetune_path": "", "test_path": "data/amr2.0_gold/dev.json", "word_vec_path": "data/vectors_amr2.0.txt.st", "suffix": " gold3", "model_dir": "logs_g2s", "isLower": true,

"pointer_gen": true,
"use_coverage": true,
"attention_vec_size": 300,
"batch_size": 20,
"beam_size": 5,

"num_syntax_match_layer": 9,
"max_node_num": 200,
"max_in_neigh_num": 2,
"max_out_neigh_num": 10,
"min_answer_len": 0,
"max_answer_len": 100,
"learning_rate": 1e-3,
"lambda_l2": 1e-3,
"dropout_rate": 0.1,
"cov_loss_wt": 0.1,
"max_epochs": 10,
"optimize_type": "adam",

"with_highway": true,
"highway_layer_num": 1,

"with_char": true,
"char_dim": 50,
"char_lstm_dim": 100,
"max_char_per_word": 20,

"attention_type": "hidden_embed",
"way_init_decoder": "all",
"edgelabel_dim": 50,
"neighbor_vector_dim": 300,
"fix_word_vec": true,
"compress_input": true,
"compress_input_dim": 300,

"gen_hidden_size": 300,
"num_softmax_samples": 100,
"mode": "ce_train",
"CE_loss":false,
"reward_type":"bleu",

"config_path": "config.json",
"generate_config": false

}

freesunshine0316 commented 5 years ago

Can you share some of your processed data? Also I use pretrained embeddings from Glove 840B.

tagoyal commented 5 years ago

These are some of the samples from the training data sample.txt

I also use the same pre-trained embeddings from Glove.

freesunshine0316 commented 5 years ago

It's obvious that you didn't tokenize your sentences. You can try using https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer_PTB.perl, and then you have to re-do everyting, such as extracting vocabulary, extracting embedding, making JSON files for training, dev and test ...

freesunshine0316 commented 5 years ago

Hope that can help.

tagoyal commented 5 years ago

Um, do you mean tokenize the "sent" part of the input file?

freesunshine0316 commented 5 years ago

Yes, and tokenization is pretty standard for text generation and other natural language processing tasks....

tagoyal commented 5 years ago

Sure, I'll so that then . Thanks!

tagoyal commented 5 years ago

Can you share the LDC2015E86 dataset so I can compare results against my other models ?

freesunshine0316 commented 5 years ago

may I ask for your email address? I assume your institute has the license as you already have the LDC17 dataset.

tagoyal commented 5 years ago

It's tanyagoyal@utexas.edu Thanks much!

tagoyal commented 5 years ago

This is a sample of the data i am using now. sample.txt

I know I could do other post processing like lower casing etc, but this gives around 5 BLEU which is too low. I have trained the model for around 15 epochs now, with lr = 0.001

Can you tell me the training procedure you used? Sorry for bothering you about this!

Thanks, Tanya goyal

freesunshine0316 commented 5 years ago

After tokenization, did you observe a much smaller vocabulary? If not, there's something wrong. Also, you need to compare with the tokenized references rather than the original ones. This is a standard too. For me, I reported uncased BLEU with multi-bleu.perl following previous work. By the way, you can turn off the highway layer, as I never found it to be useful.

One question, what embedding did you use? I saw you fix the word embeddings "fix_word_vec=true". We all use pretrained Glove embeddings

freesunshine0316 commented 5 years ago

By the way, it would be better if you can send me the whole data you processed so that I can test on my own. We have the license for LDC2017T10, please send it to lsong10@cs.rochester.edu.

tagoyal commented 5 years ago

Hi, I have sent it to your id!

Thanks, Tanya

freesunshine0316 commented 5 years ago

Just found one severe problem. The vocabulary-extracting scripts use lower cases, so you have to make the sentence part of your data be lower cases too. You can either do it before making the JSON files or modify the "read_amr_file" in G2S_data_stream.py a little bit.

tagoyal commented 5 years ago

Ok, I'll fix that and see if that fixes results? Why would lower casing lead to such a huge drop in performance? Is it coz the copy mechanism doesn't learn to copy tokens because of difference in casing?

freesunshine0316 commented 5 years ago

Because of the mismatch between your vocabulary and the data, there will be lots of UNK that shouldn’t be.

On Fri, Apr 12, 2019 at 10:43 PM tagoyal notifications@github.com wrote:

Ok, I'll fix that and see if that fixes results? Why would lower casing lead to such a huge drop in performance? Is it coz the copy mechanism doesn't learn to copy tokens because of difference in casing?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/freesunshine0316/neural-graph-to-seq-mp/issues/17#issuecomment-482769907, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5svMc0Z2R1t8Km1tOl7ZX4pnUCASIcks5vgURugaJpZM4cmCrk .

-- best wishes Linfeng Song

freesunshine0316 commented 5 years ago

Hi

You have to use the config file generated after the previous training to avoid this. Otherwise you train from scratch again. I’ll update the repository shortly. By the way, the training data you shared only has 20K instances. I believe that the number should be something close to 36K

On Sat, Apr 13, 2019 at 1:20 PM tagoyal notifications@github.com wrote:

I observed this weird thing where when i reload the model to train further, it first runs it on the dev set to get the accuracy, right? the accuracy value it calculates is almost the one that is obtained after the random initialization, instead of the one that is in the config file. Did you ever observe this?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/freesunshine0316/neural-graph-to-seq-mp/issues/17#issuecomment-482844681, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5svHqN_-oW6wEIknpPjtlzueK1WVnFks5vghHSgaJpZM4cmCrk .

-- best wishes Linfeng Song

tagoyal commented 5 years ago

If I remove the "best_accu" entry from the config file generated after the previous training, forcing the model to run again on the dev set, I still start getting the accuracy that was obtained after initial initialization.

So maybe there is some error in the model reloading, because if it actually loaded the right model, the accuracy should be same as the previously obtained maximum accuracy.

(Yes, using the entire data with 36k examples)

freesunshine0316 commented 5 years ago

That's wired. I'm outside now and will be in touch shortly.

On Sat, Apr 13, 2019 at 2:42 PM tagoyal notifications@github.com wrote:

If I remove the "best_accu" entry from the config file generated after the previous training, forcing the model to run again on the dev set, I still start getting the accuracy that was obtained after initial initialization.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/freesunshine0316/neural-graph-to-seq-mp/issues/17#issuecomment-482862571, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5svMOYlLR2YpWvA6xOAITMHocD7eE1ks5vgiUxgaJpZM4cmCrk .

-- best wishes Linfeng Song

freesunshine0316 / neural-graph-to-seq-mp

Unable to recreate paper results #17