Open tagoyal opened 5 years ago
Hi, How did you preprocess the data? You need to simplify the AMR side and tokenize the English side with a PTB_tokenizer.
Hi, I simplify the amr using the code that you have provided, then create the jsons. I then on to the vocabulary extraction part, with code in the data sub-directory.
When exactly should the tokenization happen?
Could the difference be because of the change in dataset? I am running it on the LDC2017T10 dataset? Although the difference between the accuracy I get and the paper accuracy is pretty huge.
Could you share the dataset used in the paper? Thanks, Tanya Goyal
I think the simplifier does tokenization for you, and you could use your own tokenizer too. If you directly use my released model for decoding, then you have to use the released word embeddings also. Otherwise it's wrong. I don't think the dataset is the problem, as someone else told me that my model generalizes well on them.
I don't use the released model, I use the code to train my own model. This is the config file I use:
{ "train_path": "data/amr2.0_gold/training.json", "finetune_path": "", "test_path": "data/amr2.0_gold/dev.json", "word_vec_path": "data/vectors_amr2.0.txt.st", "suffix": " gold3", "model_dir": "logs_g2s", "isLower": true,
"pointer_gen": true,
"use_coverage": true,
"attention_vec_size": 300,
"batch_size": 20,
"beam_size": 5,
"num_syntax_match_layer": 9,
"max_node_num": 200,
"max_in_neigh_num": 2,
"max_out_neigh_num": 10,
"min_answer_len": 0,
"max_answer_len": 100,
"learning_rate": 1e-3,
"lambda_l2": 1e-3,
"dropout_rate": 0.1,
"cov_loss_wt": 0.1,
"max_epochs": 10,
"optimize_type": "adam",
"with_highway": true,
"highway_layer_num": 1,
"with_char": true,
"char_dim": 50,
"char_lstm_dim": 100,
"max_char_per_word": 20,
"attention_type": "hidden_embed",
"way_init_decoder": "all",
"edgelabel_dim": 50,
"neighbor_vector_dim": 300,
"fix_word_vec": true,
"compress_input": true,
"compress_input_dim": 300,
"gen_hidden_size": 300,
"num_softmax_samples": 100,
"mode": "ce_train",
"CE_loss":false,
"reward_type":"bleu",
"config_path": "config.json",
"generate_config": false
}
Can you share some of your processed data? Also I use pretrained embeddings from Glove 840B.
These are some of the samples from the training data sample.txt
I also use the same pre-trained embeddings from Glove.
It's obvious that you didn't tokenize your sentences. You can try using https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer_PTB.perl, and then you have to re-do everyting, such as extracting vocabulary, extracting embedding, making JSON files for training, dev and test ...
Hope that can help.
Um, do you mean tokenize the "sent" part of the input file?
Yes, and tokenization is pretty standard for text generation and other natural language processing tasks....
Sure, I'll so that then . Thanks!
Can you share the LDC2015E86 dataset so I can compare results against my other models ?
may I ask for your email address? I assume your institute has the license as you already have the LDC17 dataset.
It's tanyagoyal@utexas.edu Thanks much!
This is a sample of the data i am using now. sample.txt
I know I could do other post processing like lower casing etc, but this gives around 5 BLEU which is too low. I have trained the model for around 15 epochs now, with lr = 0.001
Can you tell me the training procedure you used? Sorry for bothering you about this!
Thanks, Tanya goyal
After tokenization, did you observe a much smaller vocabulary? If not, there's something wrong. Also, you need to compare with the tokenized references rather than the original ones. This is a standard too. For me, I reported uncased BLEU with multi-bleu.perl following previous work. By the way, you can turn off the highway layer, as I never found it to be useful.
One question, what embedding did you use? I saw you fix the word embeddings "fix_word_vec=true". We all use pretrained Glove embeddings
By the way, it would be better if you can send me the whole data you processed so that I can test on my own. We have the license for LDC2017T10, please send it to lsong10@cs.rochester.edu.
Hi, I have sent it to your id!
Thanks, Tanya
Just found one severe problem. The vocabulary-extracting scripts use lower cases, so you have to make the sentence part of your data be lower cases too. You can either do it before making the JSON files or modify the "read_amr_file" in G2S_data_stream.py a little bit.
Ok, I'll fix that and see if that fixes results? Why would lower casing lead to such a huge drop in performance? Is it coz the copy mechanism doesn't learn to copy tokens because of difference in casing?
Because of the mismatch between your vocabulary and the data, there will be lots of UNK that shouldn’t be.
On Fri, Apr 12, 2019 at 10:43 PM tagoyal notifications@github.com wrote:
Ok, I'll fix that and see if that fixes results? Why would lower casing lead to such a huge drop in performance? Is it coz the copy mechanism doesn't learn to copy tokens because of difference in casing?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/freesunshine0316/neural-graph-to-seq-mp/issues/17#issuecomment-482769907, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5svMc0Z2R1t8Km1tOl7ZX4pnUCASIcks5vgURugaJpZM4cmCrk .
-- best wishes Linfeng Song
Hi
You have to use the config file generated after the previous training to avoid this. Otherwise you train from scratch again. I’ll update the repository shortly. By the way, the training data you shared only has 20K instances. I believe that the number should be something close to 36K
On Sat, Apr 13, 2019 at 1:20 PM tagoyal notifications@github.com wrote:
I observed this weird thing where when i reload the model to train further, it first runs it on the dev set to get the accuracy, right? the accuracy value it calculates is almost the one that is obtained after the random initialization, instead of the one that is in the config file. Did you ever observe this?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/freesunshine0316/neural-graph-to-seq-mp/issues/17#issuecomment-482844681, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5svHqN_-oW6wEIknpPjtlzueK1WVnFks5vghHSgaJpZM4cmCrk .
-- best wishes Linfeng Song
If I remove the "best_accu" entry from the config file generated after the previous training, forcing the model to run again on the dev set, I still start getting the accuracy that was obtained after initial initialization.
So maybe there is some error in the model reloading, because if it actually loaded the right model, the accuracy should be same as the previously obtained maximum accuracy.
(Yes, using the entire data with 36k examples)
That's wired. I'm outside now and will be in touch shortly.
On Sat, Apr 13, 2019 at 2:42 PM tagoyal notifications@github.com wrote:
If I remove the "best_accu" entry from the config file generated after the previous training, forcing the model to run again on the dev set, I still start getting the accuracy that was obtained after initial initialization.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/freesunshine0316/neural-graph-to-seq-mp/issues/17#issuecomment-482862571, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5svMOYlLR2YpWvA6xOAITMHocD7eE1ks5vgiUxgaJpZM4cmCrk .
-- best wishes Linfeng Song
Hi, I am trying to train a new model with the ldc2017 data. I took the config parameters from the gold config files. However, I get a very low blue score on the test data (around 7 bleu).
Could you point me to the correct training procedure and parameters you used for training?
Thanks, Tanya