OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.77k stars 2.25k forks source link

Abstractive Summarization Results #340

Closed mataney closed 6 years ago

mataney commented 7 years ago

Hey guys, looking at recent pull requests and issues, it looks like a common interest of contributors (On top of NMT obv) is Abstractive Summarization.

Any suggestions of how to train a model that will get close results to recent papers on the CNN-Daily Mail Database? Any additional preprocessing?

Thanks?

srush commented 7 years ago

Hey, so we are getting close to these results, but still a little bit below.

Summarization Experiment Description

This document describes how to replicate summarization experiments on the CNNDM and gigaword datasets using OpenNMT-py. In the following, we assume access to a tokenized form of the corpus split into train/valid/test set.

An example article-title pair from Gigaword should look like this:

Input australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .

Output australian current account deficit narrows sharply

Preprocessing the data

Since we are using copy-attention [1] in the model, we need to preprocess the dataset such that source and target are aligned and use the same dictionary. This is achieved by using the options dynamic_dict and share_vocab. We additionally turn off truncation of the source to ensure that inputs longer than 50 words are not truncated. For CNNDM we follow See et al. [2] and additionally truncate the source length at 400 tokens and the target at 100.

command used:

(1) CNNDM

python preprocess.py -train_src data/cnndm/train.txt.src -train_tgt data/cnn-no-sent-tag/train.txt.tgt -valid_src data/cnndm/val.txt.src -valid_tgt data/cnn-no-sent-tag/val.txt.tgt -save_data data/cnn-no-sent-tag/cnndm -src_seq_length 10000 -tgt_seq_length 10000 -src_seq_length_trunc 400 -tgt_seq_length_trunc 100 -dynamic_dict -share_vocab

(2) Gigaword

python preprocess.py -train_src data/giga/train.article.txt -train_tgt data/giga/train.title.txt -valid_src data/giga/valid.article.txt -valid_tgt data/giga/valid.title.txt -save_data data/giga/giga -src_seq_length 10000 -dynamic_dict -share_vocab

Training

The training procedure described in this section for the most part follows parameter choices and implementation similar to that of See et al. [2]. As mentioned above, we use copy attention as a mechanism for the model to decide whether to either generate a new word or to copy from the source (copy_attn). A notable difference to See's model is that we are using the attention mechanism introduced by Bahdanau et al. [3] (global_attention mlp) instead of that by Luong et al. [4] (global_attention dot). Both options typically perform very similar to each other with Luong attention often having a slight advantage. We are using using a 128-dimensional word-embedding, and 512-dimensional 1 layer LSTM. On the encoder side, we use a bidirectional LSTM (brnn), which means that the 512 dimensions are split into 256 dimensions per direction. We also share the word embeddings between encoder and decoder (share_embeddings). This option drastically reduces the number of parameters the model has to learn. However, we found only minimal impact on performance of a model without this option.

For the training procedure, we are using SGD with an initial learning rate of 1 for a total of 16 epochs. In most cases, the lowest validation perplexity is achieved around epoch 10-12. We also use OpenNMT's default learning rate decay, which halves the learning rate after every epoch once the validation perplexity increased after an epoch (or after epoch 8). Alternative training procedures such as adam with initial learning rate 0.001 converge faster than sgd, but achieve slightly worse. We additionally set the maximum norm of the gradient to 2, and renormalize if the gradient norm exceeds this value.

commands used:

(1) CNNDM

python train.py -save_model logs/notag_sgd3 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 256 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

(2) Gigaword

python train.py -save_model logs/giga_sgd3_512 -data data/giga/giga -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

Inference

During inference, we use beam-search with a beam-size of 10. We additionally use the replace_unk option which replaces generated <UNK> tokens with the source token of highest attention. This acts as safety-net should the copy attention fail which should learn to copy such words.

commands used:

(1) CNNDM

python translate.py -gpu 2 -batch_size 1 -model logs/notag_try3_acc_49.29_ppl_14.62_e16.pt -src data/cnndm/test.txt.src -output sgd3_out.txt -beam_size 10 -replace_unk

(2) Gigaword

python translate.py -gpu 2 -batch_size 1 -model logs/giga_sgd3_512_acc_51.10_ppl_12.04_e16.pt -src data/giga/test.article.txt -output giga_sgd3.out.txt -beam_size 10 -replace_unk

Evaluation

CNNDM

To evaluate the ROUGE scores on CNNDM, we extended the pyrouge wrapper with additional evaluations such as the amount of repeated n-grams (typically found in models with copy attention), found here.

It can be run with the following command:

python baseline.py -s sgd3_out.txt -t ~/datasets/cnn-dailymail/sent-tagged/test.txt.tgt -m no_sent_tag -r

Note that the no_sent_tag option strips tags around sentences - when a sentence previously was <s> w w w w . </s>, it becomes w w w w ..

Gigaword

For evaluation of large test sets such as Gigaword, we use the a parallel python wrapper around ROUGE, found here.

command used: files2rouge giga_sgd3.out.txt test.title.txt --verbose

Running the commands above should yield the following scores:

ROUGE-1 (F): 0.352127
ROUGE-2 (F): 0.173109
ROUGE-3 (F): 0.098244
ROUGE-L (F): 0.327742
ROUGE-S4 (F): 0.155524

References

[1] Vinyals, O., Fortunato, M. and Jaitly, N., 2015. Pointer Network. NIPS

[2] See, A., Liu, P.J. and Manning, C.D., 2017. Get To The Point: Summarization with Pointer-Generator Networks. ACL

[3] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR

[4] Luong, M.T., Pham, H. and Manning, C.D., 2015. Effective approaches to attention-based neural machine translation. EMNLP

mataney commented 7 years ago

This is massive! Absolutely massive! Thank you very much.

By the way, I found using See's tokenized dataset (can be downloaded here) to work better.

What data do you pass to preprocess.py

srush commented 7 years ago

Cool. Can you let us know what results you got? When you say "better", do you mean compared to what?

mataney commented 7 years ago

Hey, you wrote:

python baseline.py ...

Can't seem to find this file. Can you link me to the project?

I meant "better" by comparing the accuracy results of the original dataset to See's preprocessed runs.

sebastianGehrmann commented 7 years ago

The script we've been using is this one: https://github.com/falcondai/pyrouge/ This is a slightly modified version of the script described here: http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85

Thanks for the note about See's dataset. I will try and compare models with the different datasets

mataney commented 7 years ago

Still not sure where this baseline.py file is. I can run the script as in https://github.com/falcondai/pyrouge/ But I believe using the baseline.py with its no_sent_tag will be smarter.

pltrdy commented 7 years ago

Interesting discussion.

@srush your examples shows the -brnn flag that it now deprecated. You may want to replace it with -encoder_type brnn.

sebastianGehrmann commented 7 years ago

@mataney I linked the wrong repo - https://github.com/falcondai/rouge-baselines is what we use (that in turn uses pyrouge) One question, how do you use the preprocessed data you linked above? From my understanding, the download link has the individual documents instead of one large file. Do you just concatenate them? If so, do you have a script that I can use to reproduce your findings?

@pltrdy You're absolutely right, I copied the commands from a time before the brnn switch. We should definitely change that.

mataney commented 6 years ago

@sebastianGehrmann Cool, will run rouge-baselines on my model soon.

And in order to get just the big files I ran some of See's code (because I wanted to get another thing that is not just the article and the abstract) So the following code is just the gist of Sees preprocessing.

https://gist.github.com/mataney/67cfb05b0b84e88da3e0fe04fb80cfc8

So you can do something like this, or you can just concatenate them (the latter will be shorter)

sebastianGehrmann commented 6 years ago

Thanks, I'll check it out. To make sure we use the same exact files, could you upload yours and send me a download link via email? That'd be great! (gehrmann (at) seas.harvard.edu)

srush commented 6 years ago

Huh, this is the code I ran to make the dataset, it was forked from hers. https://github.com/OpenNMT/cnn-dailymail

I wonder if she changed anything...

srush commented 6 years ago

Oh I see, this is after the files are created. Huh, so the only thing I see that could be different is that she drops blank lines and does some unicode encoding. @mataney Could you run "sdiff " and confirm that ? I don't see anything else in this gist, but I could be missing something.

mataney commented 6 years ago

@srush This files should be the same (sdiff shouldn't work as I have more data about each article than just article and abstract (I deleted this from my gist))

I can conclude with false alarm as I didn't know you are using See's preprocessing, but you do :) So our tokenization etc are the same.

mataney commented 6 years ago

Another question, after training and translating I only get 1 sentence summaries. This seem strange. @srush are the translations your passed to baseline.py are 1 sentence summaries as well?

srush commented 6 years ago

Oh, shoot. I forgot to mention this. See uses </s> as her sentence end token, which is unfortunately what we use in translate as well :( For our experiments we replaced hers with </t>. You can either do that, or change the end condition in translate to 2 repeated </s> token.

pltrdy commented 6 years ago

Why not just replace </s> with  . ?

BTW, it seems that there is no -m no_sent_tag option in falcondai repo. I guess you are using a modified version?!

mataney commented 6 years ago

Hey guys, Any feature ideas\fixes you think about that will get us closer to See's results (seq2seq+attn+pointer then coverage).

srush commented 6 years ago

I think we are basically there. What scores are you getting?

srush commented 6 years ago

@sebastianGehrmann (when he gets back from vacation)

pltrdy commented 6 years ago

Using the HP you, @srush , mentioned above, I get the following ROUGE scores on CNN/D% (after 16 epochs):

ROUGE-1 (F): 0.323996
ROUGE-2 (F): 0.140015
ROUGE-L (F): 0.244148

ROUGE-3 (F): 0.081449
ROUGE-S4 (F): 0.105728
mataney commented 6 years ago

Getting about the same, although I'm getting better results when embedding and hidden sizes are 500. This is still rather different than what See reports - ROUGE1, 2, L - 39.53, 17.28, 36.38 respectively.

(Obviously this is said without taking anything from the brilliant work that has been done here! :smile: )

srush commented 6 years ago

Okay, let me post our model, we're doing a lot better. Think we need to update the docs.

(Although, worrisome that you are getting different results with the same args. I will check into that. )

srush commented 6 years ago

Okay, here are his args:

python train.py -save_model /scratch/cnndm/ada4 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -encoder_type brnn -epochs 16 -seed 777 -batch_size 16 -max_grad_norm 2 -share_embeddings -dropout 0. -gpuid 3 -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1

(See's RNN is split 512/256 which we don't support at the moment.)

And then during translation use Wu style coverage with -alpha 0.9 -beta 0.25

We're seeing train ppl of 12.84, val ppl of 11.98 and ROUGE-1/2 of 0.38 | 0.168

mataney commented 6 years ago

Hey :) Tried to run this and it appears to be stuck around 4% accuracy. Just pulled from master, didn't change a thing.

So only thing that might be different is the data the is being passed to preprocess.py. Something special about it?

M.

srush commented 6 years ago

Okay, let's setup a conference call to figure this out. Sebastian is away at the moment, but he will be back shortly.

sebastianGehrmann commented 6 years ago

Hey @mataney I just tried to run with the latest code and within 150 batches I am getting over 8% accuracy. (note: don't preprocess with sharding since that breaks copy attention at the moment). Can you give me some more information about your pytorch version, torchtext version, your preprocessing command etc? Also, you can you send me a download link to your dataset by email (my email is my last name @seas.harvard.edu) so I can see if there is something wrong with yours.

pltrdy commented 6 years ago

With the last set of hps @srush provided (rnnsize 512 etc) I get (after 16 epochs):

ROUGE-1 (F): 0.346241
ROUGE-2 (F): 0.143481
ROUGE-3 (F): 0.082514
ROUGE-L (F): 0.242341

I uploaded experiment details and the model: https://github.com/pltrdy/shared_models/blob/master/20180117_onmtpy_sum.md

sebastianGehrmann commented 6 years ago

Hey @pltrdy this seems closer, but not quite there yet. Without alpha and beta, the result should be around

R-1 36.2
R-2 16.2
R-L 33.4

and with your exact command, we got

R-1 38.0
R-2 16.8
R-L 35.0

I trained a model, decoded and got worse scores similar to yours. I am currently investigating why that is the case. Will update here later.

However, I noticed one thing - you are training the model with tokens in the target. I recommend that you remove them before training as we have seen a major improvement in score when doing so. You can download a modified set here: https://goo.gl/ttSJsN

pltrdy commented 6 years ago

It would be interesting to run each other's model to see where's the problem.

Would you upload yours? (producing 36.2 / 16.2 / 33.4 without alpha/beta)

sebastianGehrmann commented 6 years ago

My model is running on 1990b3b30c1c1c6d2bc3f31b08bf789bbd14886d and different torchtext version, so it makes little sense to compare across so many commits. @srush thinks the refactored preprocessing code might break copy mechanism, so I am investigating that now. You can see the comparison here: https://github.com/OpenNMT/OpenNMT-py/compare/1990b3b30c1c1c6d2bc3f31b08bf789bbd14886d...master

I uploaded the generated text here, if you want to have a look: https://drive.google.com/file/d/11QXjAnY40j14tkG8Hz4q5Yg-IKH8_XSL/view?usp=sharing

pltrdy commented 6 years ago

@sebastianGehrmann from the archive you uploaded, I ran rouge scoring on ada4_out.txt and got:

ROUGE-1 (F): 0.359900
ROUGE-2 (F): 0.161221
ROUGE-3 (F): 0.095439
ROUGE-L (F): 0.263297

R1, and R2 are ok, but ROUGE-L is 10pts below. I got those rouge score using:


I score ada4_out.txt against test.tgt.noeosbos.txt that is your test.txt.tgt without any <s> nor </s>.

I really don't get what could be wrong...

sebastianGehrmann commented 6 years ago

Can you try using another pyrouge wrapper? I found that with your files2rouge, something is funky with rouge-L. (The 0.2 point difference in rouge-1 is I think because I decoded from a different epoch than the best I got, so nothing wrong there)

ratishsp commented 6 years ago

Hi, do you follow See et al approach of applying coverage objective only after n epochs are completed?

pltrdy commented 6 years ago

@ratishsp nope. In OpenNMT-py, coverage is jointly trained from the first epoch.

ratishsp commented 6 years ago

Ok. You think your scores may improve further because of that?

ratishsp commented 6 years ago

Also, unlike what @srush mentioned in section on Training in Summarization Experiment Documentation, I believe See et al also uses attention of mlp in Bahdanau et al. Please confirm.

pltrdy commented 6 years ago

@ratishsp probably, but I don't know how much this could improve. My personnal guess is that i) it makes the training faster, ii) is marginally beneficial, which makes it an interesting -- but not mandatory -- trick.

The second comment is quite out of this issues' scope, but I do agree.

pltrdy commented 6 years ago

@sebastianGehrmann I figured out my problem. In fact, my ROUGE files was on a single line instead of 1 sentence per line, that's why the ROUGE-L was incorrect.

mataney commented 6 years ago

Hey, After talking with @sebastianGehrmann and realising I had a problem with my dataset I trained a new model with the HP as mentioned above. Still not getting that 38 ROUGE-1 results you are getting.

Translating with -beam_size 10 -replace_unk -alpha 0.9 -beta 0.25 for 16 epochs. Using @pltrdy 's files2rouge results after removing all bos and eos symbols:

1 ROUGE-1 Average_R: 0.35814 (95%-conf.int. 0.35570 - 0.36073)
1 ROUGE-1 Average_P: 0.35742 (95%-conf.int. 0.35475 - 0.36027)
1 ROUGE-1 Average_F: 0.34516 (95%-conf.int. 0.34292 - 0.34743)
---------------------------------------------
1 ROUGE-2 Average_R: 0.15130 (95%-conf.int. 0.14913 - 0.15364)
1 ROUGE-2 Average_P: 0.15203 (95%-conf.int. 0.14975 - 0.15461)
1 ROUGE-2 Average_F: 0.14613 (95%-conf.int. 0.14403 - 0.14834)
---------------------------------------------
1 ROUGE-L Average_R: 0.32948 (95%-conf.int. 0.32707 - 0.33196)
1 ROUGE-L Average_P: 0.32939 (95%-conf.int. 0.32670 - 0.33217)
1 ROUGE-L Average_F: 0.31780 (95%-conf.int. 0.31560 - 0.32008)

Same dataset, same code, same HP.

Do you have any advice?

pltrdy commented 6 years ago

@mataney my files2rouge was wrong. I just updated it (like 5mins ago) so it correctly calculates ROUGE-L for multi sentences sequences. It is basically a wrapper around pyrouge, my previous idea of parallel scoring is just bad (less accurate and not much faster).

sebastianGehrmann commented 6 years ago

@ratishsp We are using -global_attention mlp in our models, but no coverage attention at all. Instead, we use the Wu et al. coverage penalty during decoding.

It seems that somewhere in recent changes the copy attention broke, leading to decreased performance. I think it might have been partially fixed in e72c92bb49b6e3aea42f0b06962a0c814adbff95 , but according to the numbers by @mataney there is still a drop of ~2 points across the board. @

mataney commented 6 years ago

@pltrdy thanks, just pulled, ROUGE-L indeed looks better, edited original comment. @sebastianGehrmann Thanks, I think I will try later to run from the same commit as you ran, just to be sure this is indeed the case. Will try to understand later what might have been broken.

ratishsp commented 6 years ago

@sebastianGehrmann Ok. I understand. Couple of queries then: did coverage attention not work well? Wu et al coverage penalty (from GNMT I believe) may work well for translation but should it work well for summarization? As attention may not need to be applied uniformly to the source in summarization.

sebastianGehrmann commented 6 years ago

@mataney I am double-checking myself right now as well with different torchtext/pytorch/onmt versions.

@ratishsp We have not successfully managed to make coverage attention work yet, but it's an ongoing investigation. There are unfortunately several tricks needed which make the results heavily dependent on hyperparameters, random initialization etc. Training with coverage for the whole duration does not work. The coverage penalty definitely increases the results (see above for numbers), but not quite as much as the reported numbers with coverage attention by See et al.

LeenaShekhar commented 6 years ago

Thank you all for this work. I was wondering if by any chance you have the baseline (vanilla seq-2-seq with mlp attention) model for abstractive summarization available? Please let me know.

Update: Is the model mentioned here https://github.com/Iwontbecreative/OpenNMT-py an official model?

mataney commented 6 years ago

Small update:

Thanks, I think I will try later to run from the same commit as you ran, just to be sure this is indeed the case. Will try to understand later what might have been broken.

Reran it, managed to get 38 ROUGE-1. I didn't manage to figure out what might have been broken.

pltrdy commented 6 years ago

@mataney could you share the experimental setup? (hparams, commit hash, the model itself could be interesting as well to check if translation was the problem etc).

thanks

mataney commented 6 years ago

The commit I'm using is December 5th commit: 132eb635029137b02a2d8fe1c39a9dd9d4fc37cf

I ran the preprocess script using dynamic_dict and share_vocab. My train run is like the one suggested above:

python train.py -gpuid ??? -data ??? -copy_attn -global_attention mlp -word_vec_size 128 
-rnn_size 512 -layers 1 -encoder_type brnn -epochs 20 -seed 777 -batch_size 16 
-max_grad_norm 2 -share_embeddings -dropout 0. -optim adagrad -learning_rate 0.15 
-adagrad_accumulator_init 0.1

Although I'm running this for 20 epochs, But I got the 38 ROUGE-F1 results after around 11 epochs, yet to check results for later epoches.

My versions: pytorch=0.2.0_2, torchtext 0.1.1.

Tried to compare diff between this commit and master. Yet to find something interesting.

turchmo commented 6 years ago

Dear All, thanks a lot for the work done to build and fix this model. I have read the whole thread, but it is not clear to me if following the instructions in the documentation I can get the performance of the paper or there are some aspects to be fixed. Please can you let me know?

Thanks a lot!

LeenaShekhar commented 6 years ago

I had a same question as @turchmo.

By looking at the command used for training, I assume coverage penalty was not used during training. Is that why the scores are lower (some of the differences are mentioned like attention)? Coverage penalty was a major point in see et all that is why I am interested to know if the model was trained with that.

Training command posted: python train.py -save_model logs/notag_sgd3 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 256 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

I also think the pre-trained model for this is not available. The one posted by @pltrdy shows different scores than the ones reported. Link: https://github.com/pltrdy/shared_models/blob/master/20180117_onmtpy_sum.md

Below are the ROUGE scores reported by see et all and the current OpenNMT model.

  | 1 | 2 | L see ptr-gen | 36.44 | 15.66 | 33.42 see ptr-gen-cov | 39.53 | 17.28 | 36.38

| 1 | 2 | L opennmt model | 35.21 | 17.31 | 32.77