Closed mataney closed 6 years ago
Hey, so we are getting close to these results, but still a little bit below.
This document describes how to replicate summarization experiments on the CNNDM and gigaword datasets using OpenNMT-py. In the following, we assume access to a tokenized form of the corpus split into train/valid/test set.
An example article-title pair from Gigaword should look like this:
Input australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .
Output australian current account deficit narrows sharply
Since we are using copy-attention [1] in the model, we need to preprocess the dataset such that source and target are aligned and use the same dictionary. This is achieved by using the options dynamic_dict
and share_vocab
.
We additionally turn off truncation of the source to ensure that inputs longer than 50 words are not truncated.
For CNNDM we follow See et al. [2] and additionally truncate the source length at 400 tokens and the target at 100.
command used:
(1) CNNDM
python preprocess.py -train_src data/cnndm/train.txt.src -train_tgt data/cnn-no-sent-tag/train.txt.tgt -valid_src data/cnndm/val.txt.src -valid_tgt data/cnn-no-sent-tag/val.txt.tgt -save_data data/cnn-no-sent-tag/cnndm -src_seq_length 10000 -tgt_seq_length 10000 -src_seq_length_trunc 400 -tgt_seq_length_trunc 100 -dynamic_dict -share_vocab
(2) Gigaword
python preprocess.py -train_src data/giga/train.article.txt -train_tgt data/giga/train.title.txt -valid_src data/giga/valid.article.txt -valid_tgt data/giga/valid.title.txt -save_data data/giga/giga -src_seq_length 10000 -dynamic_dict -share_vocab
The training procedure described in this section for the most part follows parameter choices and implementation similar to that of See et al. [2]. As mentioned above, we use copy attention as a mechanism for the model to decide whether to either generate a new word or to copy from the source (copy_attn
).
A notable difference to See's model is that we are using the attention mechanism introduced by Bahdanau et al. [3] (global_attention mlp
) instead of that by Luong et al. [4] (global_attention dot
). Both options typically perform very similar to each other with Luong attention often having a slight advantage.
We are using using a 128-dimensional word-embedding, and 512-dimensional 1 layer LSTM. On the encoder side, we use a bidirectional LSTM (brnn
), which means that the 512 dimensions are split into 256 dimensions per direction.
We also share the word embeddings between encoder and decoder (share_embeddings
). This option drastically reduces the number of parameters the model has to learn. However, we found only minimal impact on performance of a model without this option.
For the training procedure, we are using SGD with an initial learning rate of 1 for a total of 16 epochs. In most cases, the lowest validation perplexity is achieved around epoch 10-12. We also use OpenNMT's default learning rate decay, which halves the learning rate after every epoch once the validation perplexity increased after an epoch (or after epoch 8). Alternative training procedures such as adam with initial learning rate 0.001 converge faster than sgd, but achieve slightly worse. We additionally set the maximum norm of the gradient to 2, and renormalize if the gradient norm exceeds this value.
commands used:
(1) CNNDM
python train.py -save_model logs/notag_sgd3 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 256 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9
(2) Gigaword
python train.py -save_model logs/giga_sgd3_512 -data data/giga/giga -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9
During inference, we use beam-search with a beam-size of 10.
We additionally use the replace_unk
option which replaces generated <UNK>
tokens with the source token of highest attention. This acts as safety-net should the copy attention fail which should learn to copy such words.
commands used:
(1) CNNDM
python translate.py -gpu 2 -batch_size 1 -model logs/notag_try3_acc_49.29_ppl_14.62_e16.pt -src data/cnndm/test.txt.src -output sgd3_out.txt -beam_size 10 -replace_unk
(2) Gigaword
python translate.py -gpu 2 -batch_size 1 -model logs/giga_sgd3_512_acc_51.10_ppl_12.04_e16.pt -src data/giga/test.article.txt -output giga_sgd3.out.txt -beam_size 10 -replace_unk
To evaluate the ROUGE scores on CNNDM, we extended the pyrouge wrapper with additional evaluations such as the amount of repeated n-grams (typically found in models with copy attention), found here.
It can be run with the following command:
python baseline.py -s sgd3_out.txt -t ~/datasets/cnn-dailymail/sent-tagged/test.txt.tgt -m no_sent_tag -r
Note that the no_sent_tag
option strips tags around sentences - when a sentence previously was <s> w w w w . </s>
, it becomes w w w w .
.
For evaluation of large test sets such as Gigaword, we use the a parallel python wrapper around ROUGE, found here.
command used:
files2rouge giga_sgd3.out.txt test.title.txt --verbose
Running the commands above should yield the following scores:
ROUGE-1 (F): 0.352127
ROUGE-2 (F): 0.173109
ROUGE-3 (F): 0.098244
ROUGE-L (F): 0.327742
ROUGE-S4 (F): 0.155524
[1] Vinyals, O., Fortunato, M. and Jaitly, N., 2015. Pointer Network. NIPS
[2] See, A., Liu, P.J. and Manning, C.D., 2017. Get To The Point: Summarization with Pointer-Generator Networks. ACL
[3] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR
[4] Luong, M.T., Pham, H. and Manning, C.D., 2015. Effective approaches to attention-based neural machine translation. EMNLP
Cool. Can you let us know what results you got? When you say "better", do you mean compared to what?
Hey, you wrote:
python baseline.py ...
Can't seem to find this file. Can you link me to the project?
I meant "better" by comparing the accuracy results of the original dataset to See's preprocessed runs.
The script we've been using is this one: https://github.com/falcondai/pyrouge/ This is a slightly modified version of the script described here: http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85
Thanks for the note about See's dataset. I will try and compare models with the different datasets
Still not sure where this baseline.py
file is.
I can run the script as in https://github.com/falcondai/pyrouge/
But I believe using the baseline.py
with its no_sent_tag
will be smarter.
Interesting discussion.
@srush your examples shows the -brnn
flag that it now deprecated. You may want to replace it with -encoder_type brnn
.
@mataney I linked the wrong repo - https://github.com/falcondai/rouge-baselines is what we use (that in turn uses pyrouge) One question, how do you use the preprocessed data you linked above? From my understanding, the download link has the individual documents instead of one large file. Do you just concatenate them? If so, do you have a script that I can use to reproduce your findings?
@pltrdy You're absolutely right, I copied the commands from a time before the brnn
switch. We should definitely change that.
@sebastianGehrmann Cool, will run rouge-baselines on my model soon.
And in order to get just the big files I ran some of See's code (because I wanted to get another thing that is not just the article and the abstract) So the following code is just the gist of Sees preprocessing.
https://gist.github.com/mataney/67cfb05b0b84e88da3e0fe04fb80cfc8
So you can do something like this, or you can just concatenate them (the latter will be shorter)
Thanks, I'll check it out. To make sure we use the same exact files, could you upload yours and send me a download link via email? That'd be great! (gehrmann (at) seas.harvard.edu)
Huh, this is the code I ran to make the dataset, it was forked from hers. https://github.com/OpenNMT/cnn-dailymail
I wonder if she changed anything...
Oh I see, this is after the files are created. Huh, so the only thing I see that could be different is that she drops blank lines and does some unicode encoding. @mataney Could you run "sdiff
@srush This files should be the same (sdiff shouldn't work as I have more data about each article than just article
and abstract
(I deleted this from my gist))
I can conclude with false alarm as I didn't know you are using See's preprocessing, but you do :) So our tokenization etc are the same.
Another question, after training and translating I only get 1 sentence summaries. This seem strange.
@srush are the translations your passed to baseline.py
are 1 sentence summaries as well?
Oh, shoot. I forgot to mention this. See uses </s>
as her sentence end token, which is unfortunately what we use in translate as well :( For our experiments we replaced hers with </t>
. You can either do that, or change the end condition in translate to 2 repeated </s>
token.
Why not just replace </s>
with .
?
BTW, it seems that there is no -m no_sent_tag
option in falcondai repo. I guess you are using a modified version?!
Hey guys, Any feature ideas\fixes you think about that will get us closer to See's results (seq2seq+attn+pointer then coverage).
I think we are basically there. What scores are you getting?
@sebastianGehrmann (when he gets back from vacation)
Using the HP you, @srush , mentioned above, I get the following ROUGE scores on CNN/D% (after 16 epochs):
ROUGE-1 (F): 0.323996
ROUGE-2 (F): 0.140015
ROUGE-L (F): 0.244148
ROUGE-3 (F): 0.081449
ROUGE-S4 (F): 0.105728
Getting about the same, although I'm getting better results when embedding and hidden sizes are 500. This is still rather different than what See reports - ROUGE1, 2, L - 39.53, 17.28, 36.38 respectively.
(Obviously this is said without taking anything from the brilliant work that has been done here! :smile: )
Okay, let me post our model, we're doing a lot better. Think we need to update the docs.
(Although, worrisome that you are getting different results with the same args. I will check into that. )
Okay, here are his args:
python train.py -save_model /scratch/cnndm/ada4 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -encoder_type brnn -epochs 16 -seed 777 -batch_size 16 -max_grad_norm 2 -share_embeddings -dropout 0. -gpuid 3 -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1
(See's RNN is split 512/256 which we don't support at the moment.)
And then during translation use Wu style coverage with -alpha 0.9 -beta 0.25
We're seeing train ppl of 12.84, val ppl of 11.98 and ROUGE-1/2 of 0.38 | 0.168
Hey :) Tried to run this and it appears to be stuck around 4% accuracy. Just pulled from master, didn't change a thing.
So only thing that might be different is the data the is being passed to preprocess.py
.
Something special about it?
M.
Okay, let's setup a conference call to figure this out. Sebastian is away at the moment, but he will be back shortly.
Hey @mataney I just tried to run with the latest code and within 150 batches I am getting over 8% accuracy. (note: don't preprocess with sharding since that breaks copy attention at the moment). Can you give me some more information about your pytorch version, torchtext version, your preprocessing command etc? Also, you can you send me a download link to your dataset by email (my email is my last name @seas.harvard.edu) so I can see if there is something wrong with yours.
With the last set of hps @srush provided (rnnsize 512 etc) I get (after 16 epochs):
ROUGE-1 (F): 0.346241
ROUGE-2 (F): 0.143481
ROUGE-3 (F): 0.082514
ROUGE-L (F): 0.242341
I uploaded experiment details and the model: https://github.com/pltrdy/shared_models/blob/master/20180117_onmtpy_sum.md
Hey @pltrdy this seems closer, but not quite there yet. Without alpha and beta, the result should be around
R-1 36.2
R-2 16.2
R-L 33.4
and with your exact command, we got
R-1 38.0
R-2 16.8
R-L 35.0
I trained a model, decoded and got worse scores similar to yours. I am currently investigating why that is the case. Will update here later.
However, I noticed one thing - you are training the model with
It would be interesting to run each other's model to see where's the problem.
Would you upload yours? (producing 36.2 / 16.2 / 33.4 without alpha/beta)
My model is running on 1990b3b30c1c1c6d2bc3f31b08bf789bbd14886d and different torchtext version, so it makes little sense to compare across so many commits. @srush thinks the refactored preprocessing code might break copy mechanism, so I am investigating that now. You can see the comparison here: https://github.com/OpenNMT/OpenNMT-py/compare/1990b3b30c1c1c6d2bc3f31b08bf789bbd14886d...master
I uploaded the generated text here, if you want to have a look: https://drive.google.com/file/d/11QXjAnY40j14tkG8Hz4q5Yg-IKH8_XSL/view?usp=sharing
@sebastianGehrmann from the archive you uploaded, I ran rouge scoring on ada4_out.txt
and got:
ROUGE-1 (F): 0.359900
ROUGE-2 (F): 0.161221
ROUGE-3 (F): 0.095439
ROUGE-L (F): 0.263297
R1, and R2 are ok, but ROUGE-L is 10pts below. I got those rouge score using:
files2rouge
utility with default options, files2rouge
with (slightly different) args from falcondai pyrouge repo, I score ada4_out.txt
against test.tgt.noeosbos.txt
that is your test.txt.tgt
without any <s>
nor </s>
.
I really don't get what could be wrong...
Can you try using another pyrouge wrapper? I found that with your files2rouge, something is funky with rouge-L. (The 0.2 point difference in rouge-1 is I think because I decoded from a different epoch than the best I got, so nothing wrong there)
Hi, do you follow See et al approach of applying coverage objective only after n epochs are completed?
@ratishsp nope. In OpenNMT-py, coverage is jointly trained from the first epoch.
Ok. You think your scores may improve further because of that?
Also, unlike what @srush mentioned in section on Training in Summarization Experiment Documentation, I believe See et al also uses attention of mlp in Bahdanau et al. Please confirm.
@ratishsp probably, but I don't know how much this could improve. My personnal guess is that i) it makes the training faster, ii) is marginally beneficial, which makes it an interesting -- but not mandatory -- trick.
The second comment is quite out of this issues' scope, but I do agree.
@sebastianGehrmann I figured out my problem. In fact, my ROUGE files was on a single line instead of 1 sentence per line, that's why the ROUGE-L was incorrect.
Hey, After talking with @sebastianGehrmann and realising I had a problem with my dataset I trained a new model with the HP as mentioned above. Still not getting that 38 ROUGE-1 results you are getting.
Translating with -beam_size 10 -replace_unk -alpha 0.9 -beta 0.25
for 16 epochs.
Using @pltrdy 's files2rouge
results after removing all bos
and eos
symbols:
1 ROUGE-1 Average_R: 0.35814 (95%-conf.int. 0.35570 - 0.36073)
1 ROUGE-1 Average_P: 0.35742 (95%-conf.int. 0.35475 - 0.36027)
1 ROUGE-1 Average_F: 0.34516 (95%-conf.int. 0.34292 - 0.34743)
---------------------------------------------
1 ROUGE-2 Average_R: 0.15130 (95%-conf.int. 0.14913 - 0.15364)
1 ROUGE-2 Average_P: 0.15203 (95%-conf.int. 0.14975 - 0.15461)
1 ROUGE-2 Average_F: 0.14613 (95%-conf.int. 0.14403 - 0.14834)
---------------------------------------------
1 ROUGE-L Average_R: 0.32948 (95%-conf.int. 0.32707 - 0.33196)
1 ROUGE-L Average_P: 0.32939 (95%-conf.int. 0.32670 - 0.33217)
1 ROUGE-L Average_F: 0.31780 (95%-conf.int. 0.31560 - 0.32008)
Same dataset, same code, same HP.
Do you have any advice?
@mataney my files2rouge
was wrong. I just updated it (like 5mins ago) so it correctly calculates ROUGE-L for multi sentences sequences. It is basically a wrapper around pyrouge, my previous idea of parallel scoring is just bad (less accurate and not much faster).
@ratishsp We are using -global_attention mlp
in our models, but no coverage attention at all. Instead, we use the Wu et al. coverage penalty during decoding.
It seems that somewhere in recent changes the copy attention broke, leading to decreased performance. I think it might have been partially fixed in e72c92bb49b6e3aea42f0b06962a0c814adbff95 , but according to the numbers by @mataney there is still a drop of ~2 points across the board. @
@pltrdy thanks, just pulled, ROUGE-L indeed looks better, edited original comment. @sebastianGehrmann Thanks, I think I will try later to run from the same commit as you ran, just to be sure this is indeed the case. Will try to understand later what might have been broken.
@sebastianGehrmann Ok. I understand. Couple of queries then: did coverage attention not work well? Wu et al coverage penalty (from GNMT I believe) may work well for translation but should it work well for summarization? As attention may not need to be applied uniformly to the source in summarization.
@mataney I am double-checking myself right now as well with different torchtext/pytorch/onmt versions.
@ratishsp We have not successfully managed to make coverage attention work yet, but it's an ongoing investigation. There are unfortunately several tricks needed which make the results heavily dependent on hyperparameters, random initialization etc. Training with coverage for the whole duration does not work. The coverage penalty definitely increases the results (see above for numbers), but not quite as much as the reported numbers with coverage attention by See et al.
Thank you all for this work. I was wondering if by any chance you have the baseline (vanilla seq-2-seq with mlp attention) model for abstractive summarization available? Please let me know.
Update: Is the model mentioned here https://github.com/Iwontbecreative/OpenNMT-py an official model?
Small update:
Thanks, I think I will try later to run from the same commit as you ran, just to be sure this is indeed the case. Will try to understand later what might have been broken.
Reran it, managed to get 38 ROUGE-1. I didn't manage to figure out what might have been broken.
@mataney could you share the experimental setup? (hparams, commit hash, the model itself could be interesting as well to check if translation was the problem etc).
thanks
The commit I'm using is December 5th commit: 132eb635029137b02a2d8fe1c39a9dd9d4fc37cf
I ran the preprocess script using dynamic_dict
and share_vocab
.
My train run is like the one suggested above:
python train.py -gpuid ??? -data ??? -copy_attn -global_attention mlp -word_vec_size 128
-rnn_size 512 -layers 1 -encoder_type brnn -epochs 20 -seed 777 -batch_size 16
-max_grad_norm 2 -share_embeddings -dropout 0. -optim adagrad -learning_rate 0.15
-adagrad_accumulator_init 0.1
Although I'm running this for 20 epochs, But I got the 38 ROUGE-F1 results after around 11 epochs, yet to check results for later epoches.
My versions: pytorch=0.2.0_2, torchtext 0.1.1.
Tried to compare diff between this commit and master. Yet to find something interesting.
Dear All, thanks a lot for the work done to build and fix this model. I have read the whole thread, but it is not clear to me if following the instructions in the documentation I can get the performance of the paper or there are some aspects to be fixed. Please can you let me know?
Thanks a lot!
I had a same question as @turchmo.
By looking at the command used for training, I assume coverage penalty was not used during training. Is that why the scores are lower (some of the differences are mentioned like attention)? Coverage penalty was a major point in see et all that is why I am interested to know if the model was trained with that.
Training command posted:
python train.py -save_model logs/notag_sgd3 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 256 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9
I also think the pre-trained model for this is not available. The one posted by @pltrdy shows different scores than the ones reported. Link: https://github.com/pltrdy/shared_models/blob/master/20180117_onmtpy_sum.md
Below are the ROUGE scores reported by see et all and the current OpenNMT model.
| 1 | 2 | L see ptr-gen | 36.44 | 15.66 | 33.42 see ptr-gen-cov | 39.53 | 17.28 | 36.38
| 1 | 2 | L opennmt model | 35.21 | 17.31 | 32.77
Hey guys, looking at recent pull requests and issues, it looks like a common interest of contributors (On top of NMT obv) is Abstractive Summarization.
Any suggestions of how to train a model that will get close results to recent papers on the CNN-Daily Mail Database? Any additional preprocessing?
Thanks?