abisee / pointer-generator

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
Other
2.18k stars 811 forks source link

Failure to replicate results #16

Closed hate5six closed 7 years ago

hate5six commented 7 years ago

Has anyone been able to successfully replicate the model from the paper? I've been training for about two weeks (over 240k iterations) using the published parameters, along with training an additional ~3k iterations with coverage. Here is what my training loss looks like:

image

Unclear what caused the increase starting around iteration 180k, but even then the output was not looking great.

Here are some REF (gold) and DEC (system) summaries. As you can see, they are qualitatively bad. Unfortunately, at the moment, I can't figure out how to get pyrouge to run so I can't quantify the performance relative to the published results.

000000_reference.txt

REF: marseille prosecutor says so far no videos were used in the crash investigation '' despite media reports . journalists at bild and paris match are very confident '' the video clip is real , an editor says . andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

DEC: robin 's comments are aware of any video footage , german paris match . he 's accused into the crash of germanwings flight 9525 flight . prosecutor : `` it is a very disturbing scene ''

000001_reference.txt

REF: membership gives the icc jurisdiction over alleged crimes committed in palestinian territories since last june . israel and the united states opposed the move , which could open the door to war crimes investigations against israelis .

DEC: palestinians signed icc 's founding rome statute of alleged crimes in palestinian territories . israel says `` in the occupied palestinian territory to immediately end and injustice , she says . it 's founding rome .

000002_reference.txt

REF: amnesty 's annual death penalty report catalogs encouraging signs , but setbacks in numbers of those sentenced to death . organization claims that governments around the world are using the threat of terrorism to advance executions . the number of executions worldwide has gone down by almost 22 % compared with 2013 , but death sentences up by 28 % .

DEC: it 's death sentences and executions 2014 '' is some we are closer to abolition , to advance executions '' number of deterrence , `` a number of countries are abolitionist '' amnesty says he would not be used for the death penalty .

000003_reference.txt

REF: amnesty international releases its annual review of the death penalty worldwide ; much of it makes for grim reading . salil shetty : countries that use executions to deal with problems are on the wrong side of history .

DEC: soldiers who a china agreed to tackle a surge in death sentences to death . jordan ended china 's public mass sentencing is part a china 's northwestern xinjiang region . a sharp spike in december 2006 , 2014 , 2014 .

000004_reference.txt

REF: museum : anne frank died earlier than previously believed . researchers re-examined archives and testimonies of survivors . anne and older sister margot frank are believed to have died in february 1945 .

DEC: bergen-belsen concentration camp is believed death on march 31 , anne frank says . four the jewish diarist concentration camp , margot , margot , violent , died at the age of 15 . `` i am no more than a skeleton camp , '' witnesses say .

If anyone has had success reproducing the published model I would love to hear how you did it. I'm stumped.

abisee commented 7 years ago

Hi @hate5six. Yes, looks like your system is generating summaries that are quite nonsensical. I'm not sure what would cause this.

  1. You say your 240k iterations took two weeks. Did you start by setting max_enc_steps and max_dec_steps to something small, then increase later? There's a note about it in the README. We found this necessary in order to do quicker iterations. That's how we did 230k iterations in the 3 days 4 hours reported in the paper. I think most of our training iterations for the pointer-generator model were with max_enc_steps=200 and max_dec_steps=50, then we increased to the full max_enc_steps=400 and max_dec_steps=100 only for a short training period near the end. I think you could go even lower -- with the baseline model we experimented with starting at max_enc_steps=50 and max_dec_steps=50 and that was even more efficient.

  2. Have you seen the note about chunking data? I'm not sure if this actually makes a difference to reproducibility, but we released the note in case it does.

  3. Are you aware that you can look at the full output on the test data (there's a link to it in the README)? This gives you a way to compare with the published results, that's more comprehensive than just the ROUGE numbers / handful of examples in the paper.

hate5six commented 7 years ago

Hi @abisee,

I've tried training the model in several different ways on multiple GPU machines--starting with max enc/dec steps at different configurations like 50/50 and scaling them up. All attempts where failing to replicate results so at some point I decided to just let it train at 400/100 at the cost of running time. Part of that was due to not having clear intuition about when to ramp up. I currently have a separate model training that started out at 50/50 and then was ramped up to 200/50, and is currently at 160k iterations. As you can see it hasn't seemed to flatten out much:

image

I saw the update about chunking data earlier but I haven't incorporated it yet. I similarly don't have the intuition of whether I would need to restart training from scratch or continue training from my current point (240k iterations)--and if I can resume then for how long.

Thanks for the link to the output. I'm looking at it and agree your published results look far more comprehensible. I understand that training models can be a delicate process, but I'm struggling to see what could potentially be happening.

abisee commented 7 years ago

@hate5six Thanks for the feedback. We want these results to be reproducible, which is why we released the code. Before release we needed to significantly cleanup the code to make it comprehensible and usable for others, upgrade to TensorFlow 1.0, etc. Hopefully these changes haven't affected the reproducibility of the experiments, which as you say are a delicate process anyway.

When I have some time I'll look into this and try to figure out what might be going on.

hate5six commented 7 years ago

@abisee I would appreciate that! Perhaps in the meantime I can fork my 200/50 model and try running with the chunked input and let that run for a while before ramping up to 400/100

makcbe commented 7 years ago

@abisee the following results are obtained: INFO:tensorflow:REFERENCE SUMMARY: the 20th mls season begins this weekend . league has changed dramatically since its inception in 1996 . some question whether rules regarding salary caps and transfers need to change . INFO:tensorflow:GENERATED SUMMARY: historic occasion was the first ever major league soccer match . the historic occasion was the first ever major league soccer match . it 's first of a new domestic tv and media rights deal with american soccer .

INFO:tensorflow:REFERENCE SUMMARY: bafetimbi gomis collapses within 10 minutes of kickoff at tottenham . but he reportedly left the pitch conscious and wearing an oxygen mask . gomis later said that he was `` feeling well '' . the incident came three years after fabrice muamba collapsed at white hart lane .

INFO:tensorflow:GENERATED SUMMARY: french striker bafetimbi gomis was feeling well '' after collapsing during tottenham . tottenham scored in the seventh minute of treatment . gomis was fine , '' with manager garry monk using the same condition .

While the first one looks off the mark, the generated second summary (about french striker bafetimbi ) is definitely making sense (great work you and team. Thanks a ton.). I am looking at the details.

Unfortunately, run_summarization stopped with "Error: model produced a word ID that isn't in the vocabulary". Any idea why this might occur and how to solve this? Note: this is against copy of o/p with coverage=true. that process is still running though the coverage loss is increasing

---------------- UPDATE ---------------

The above error is resolved by adding the below lines to data.py

import tensorflow as tf
FLAGS = tf.app.flags.FLAGS

It perhaps skips the word id and proceeds. @abisee - will it be the right next step to increase the vocab count to 200k and start the train, eval process again?

ioana-blue commented 7 years ago

Wondering if there is anyone out there who was able to reproduce the results. I'm currently using chunked data and I played with 200/50 and 400/100 for the sequence lengths. I ran several times into NaN issues (after 35K, around 70K and around 145K steps). I restarted each time, last time with a saved checkpoint from around 110K step. I find this rather painful and wondering why this happens.

Any insights? @abisee Thank you in advance for all the work!

hate5six commented 7 years ago

Hi @ioana-blue, I too have lost a lot of time trying to get things to run. Make sure you have the latest version of the code and data. There were changes recently to handle empty articles which cause some of the NaN issues. You may also want to look at this: https://github.com/abisee/pointer-generator/issues/4

I found that helpful in getting around the NaN issues at 35k. I was able to train the model to over 200k iterations after making that change.

ioana-blue commented 7 years ago

@hate5six thank you for your input. my code is up-to-date, I think. I added the modification that you suggested from #4 . I had used it before with no success, I added it back, let's see. I'll update here when I have news.

ioana-blue commented 7 years ago

@hate5six Do you get reasonable results when decoding with the trained model for 200K iterations?

hate5six commented 7 years ago

@ioana-blue my model reached it last night and I'm just starting to investigate its performance now. I did run with coverage for an additional 3k iterations this morning, as advised by the authors. The generated summaries on the test set are readable, however, I'm also running the model on some of my own data and I'm seeing that the results are largely extractive. If anything the model removes some excess but it doesn't seem to be generating text. I have an issue/example at: https://github.com/abisee/pointer-generator/issues/21

abisee commented 7 years ago

@makcbe: that FLAGS error was fixed in this commit. Make sure you pull it and other fixes.

abisee commented 7 years ago

@makcbe:

@abisee - will it be the right next step to increase the vocab count to 200k and start the train, eval process again?

Increasing the vocabulary size from 50k up to 200k will make everything run significantly slower. In our paper we found that (at least for the baseline model) a vocabulary of 150k didn't work better than a vocabulary of 50k. If I had to guess, I don't think increasing the vocabulary would help much.

As you found, that Error: model produced a word ID that isn't in the vocabulary message was printing because there was a small bug regarding FLAGS. I think nothing was actually going wrong with the vocabulary and word IDs. So increasing vocabulary was a red herring - sorry for confusion.

EDIT: tagged the wrong person.

ioana-blue commented 7 years ago

@abisee I think your previous message was meant for @makcbe if I'm not mistaken :)

pltrdy commented 7 years ago

@hate5six did you manage to replicate the results?

I trained the model with all default parameter (then, it's the pointer model, right?!) and added some coverage training. The decoding give results far from what is published:

ROUGE-1:
rouge_1_f_score: 0.3142 with confidence interval (0.3122, 0.3161)
rouge_1_recall: 0.3368 with confidence interval (0.3345, 0.3391)
rouge_1_precision: 0.3218 with confidence interval (0.3193, 0.3241)

ROUGE-2:
rouge_2_f_score: 0.1240 with confidence interval (0.1222, 0.1255)
rouge_2_recall: 0.1339 with confidence interval (0.1320, 0.1357)
rouge_2_precision: 0.1267 with confidence interval (0.1249, 0.1283)

ROUGE-l:
rouge_l_f_score: 0.2855 with confidence interval (0.2837, 0.2874)
rouge_l_recall: 0.3060 with confidence interval (0.3036, 0.3082)
rouge_l_precision: 0.2927 with confidence interval (0.2903, 0.2948)
hate5six commented 7 years ago

@pltrdy I did, qualitatively at least. The results I was seeing looked similar to what's in the paper/blog post and the final loss with/without coverage agreed, too. I had trouble installing pyrouge so my analysis stopped short of comparing the actual ROUGE scores, unfortunately.

pltrdy commented 7 years ago

@hate5six could you then provide your decoded files? Would be interested to score it myself.

hate5six commented 7 years ago

@pltrdy it looks like I only ran it on 1500 validation articles. I'd be happy to share them but it also wouldn't make for a complete comparison against the published results. Let me know. I could process the whole validation set but I don't have an ETA on that at the moment given my current workload/resource availability.

pltrdy commented 7 years ago

Or maybe could you just upload your best checkpoint files? I would run decoding & scoring then (and share the result with you if you want)

abhishekraok commented 7 years ago

@pltrdy I trained according the given number of steps in paper and I was able to get Rouge-1 F1 score = 37.34 | Rouge-2 F1 score = 16.92 Which is 2.19/0.36 points less than paper. I can share the checkpoint files if needed.

I used Java Rouge calculator. I couldn't figure out how to use the pyrouge. Do you have the files/instructions need to use pyrouge? Thanks

pltrdy commented 7 years ago

@abhishekraok interesting, could I know how did you trained your model? (how many training steps? how many coverage training steps?)

You can find PyRouge instructions here: https://pypi.python.org/pypi/pyrouge/0.1.3 you basically need to download ROUGE and to tell pyrouge where it is located using: pyrouge_set_rouge_path /absolute/path/to/ROUGE-1.5.5/directory.

I'm personnally using my own files2rouge tool.

abhishekraok commented 7 years ago

Actually I managed to figure out how to get the Perl Rouge files. I have added a PR for that to be added to the readme. Here are the numbers

ROUGE-1: rouge_1_f_score: 0.3834 with confidence interval (0.3812, 0.3856) rouge_1_recall: 0.4362 with confidence interval (0.4335, 0.4390) rouge_1_precision: 0.3640 with confidence interval (0.3616, 0.3665)

ROUGE-2: rouge_2_f_score: 0.1668 with confidence interval (0.1647, 0.1689) rouge_2_recall: 0.1900 with confidence interval (0.1876, 0.1923) rouge_2_precision: 0.1587 with confidence interval (0.1566, 0.1608)

ROUGE-l: rouge_l_f_score: 0.3500 with confidence interval (0.3478, 0.3521) rouge_l_recall: 0.3978 with confidence interval (0.3953, 0.4003) rouge_l_precision: 0.3325 with confidence interval (0.3302, 0.3348)

abhishekraok commented 7 years ago

@pltrdy I trained for 230K steps without coverage and 3k steps with coverage as was mentioned in the paper. I have modified the code in my fork to run for exact number of steps. Thanks for the info on Rouge.

abisee commented 7 years ago

You may be interested in the pretrained model that is now available -- see README.

jasonwbw commented 6 years ago

@abhishekraok Thank you for sharing your modifications, could you share a checkpoint for us to evaluate the model? Thank you very much.

LeenaShekhar commented 6 years ago

@abhishekraok, @ioana-blue and @hate5six Did any of you train the baseline model for this? Due to TF version issue author could not publish the pre-trained model for the baseline architecture so it would be really great if any of you did it and is ready to share it. Thank you.

abhishekraok commented 6 years ago

@jasonwbw @LeenaShekhar I have uploaded the files from which I got the metrics here https://1drv.ms/f/s!AizkA_PooBXSlIJO82kT0PZ_IboUrA It uses TF 1.2.0

LeenaShekhar commented 6 years ago

@abhishekraok Sorry for such a late reply. Just to be sure, are those files for the baseline model? Thank you so much for sharing them.

abhishekraok commented 6 years ago

Sorry, this is not the baseline model. This is one with the pointer generator and coverage. I tried to get the best metrics possible.