Closed hate5six closed 7 years ago
Hi @hate5six. Yes, looks like your system is generating summaries that are quite nonsensical. I'm not sure what would cause this.
You say your 240k iterations took two weeks. Did you start by setting max_enc_steps
and max_dec_steps
to something small, then increase later? There's a note about it in the README. We found this necessary in order to do quicker iterations. That's how we did 230k iterations in the 3 days 4 hours reported in the paper. I think most of our training iterations for the pointer-generator model were with max_enc_steps=200
and max_dec_steps=50
, then we increased to the full max_enc_steps=400
and max_dec_steps=100
only for a short training period near the end. I think you could go even lower -- with the baseline model we experimented with starting at max_enc_steps=50
and max_dec_steps=50
and that was even more efficient.
Have you seen the note about chunking data? I'm not sure if this actually makes a difference to reproducibility, but we released the note in case it does.
Are you aware that you can look at the full output on the test data (there's a link to it in the README)? This gives you a way to compare with the published results, that's more comprehensive than just the ROUGE numbers / handful of examples in the paper.
Hi @abisee,
I've tried training the model in several different ways on multiple GPU machines--starting with max enc/dec steps at different configurations like 50/50 and scaling them up. All attempts where failing to replicate results so at some point I decided to just let it train at 400/100 at the cost of running time. Part of that was due to not having clear intuition about when to ramp up. I currently have a separate model training that started out at 50/50 and then was ramped up to 200/50, and is currently at 160k iterations. As you can see it hasn't seemed to flatten out much:
I saw the update about chunking data earlier but I haven't incorporated it yet. I similarly don't have the intuition of whether I would need to restart training from scratch or continue training from my current point (240k iterations)--and if I can resume then for how long.
Thanks for the link to the output. I'm looking at it and agree your published results look far more comprehensible. I understand that training models can be a delicate process, but I'm struggling to see what could potentially be happening.
@hate5six Thanks for the feedback. We want these results to be reproducible, which is why we released the code. Before release we needed to significantly cleanup the code to make it comprehensible and usable for others, upgrade to TensorFlow 1.0, etc. Hopefully these changes haven't affected the reproducibility of the experiments, which as you say are a delicate process anyway.
When I have some time I'll look into this and try to figure out what might be going on.
@abisee I would appreciate that! Perhaps in the meantime I can fork my 200/50 model and try running with the chunked input and let that run for a while before ramping up to 400/100
@abisee the following results are obtained: INFO:tensorflow:REFERENCE SUMMARY: the 20th mls season begins this weekend . league has changed dramatically since its inception in 1996 . some question whether rules regarding salary caps and transfers need to change . INFO:tensorflow:GENERATED SUMMARY: historic occasion was the first ever major league soccer match . the historic occasion was the first ever major league soccer match . it 's first of a new domestic tv and media rights deal with american soccer .
INFO:tensorflow:REFERENCE SUMMARY: bafetimbi gomis collapses within 10 minutes of kickoff at tottenham . but he reportedly left the pitch conscious and wearing an oxygen mask . gomis later said that he was `` feeling well '' . the incident came three years after fabrice muamba collapsed at white hart lane .
INFO:tensorflow:GENERATED SUMMARY: french striker bafetimbi gomis was feeling well '' after collapsing during tottenham . tottenham scored in the seventh minute of treatment . gomis was
fine , '' with manager garry monk using the same condition .
While the first one looks off the mark, the generated second summary (about french striker bafetimbi ) is definitely making sense (great work you and team. Thanks a ton.). I am looking at the details.
Unfortunately, run_summarization stopped with "Error: model produced a word ID that isn't in the vocabulary". Any idea why this might occur and how to solve this? Note: this is against copy of o/p with coverage=true. that process is still running though the coverage loss is increasing
---------------- UPDATE ---------------
The above error is resolved by adding the below lines to data.py
import tensorflow as tf
FLAGS = tf.app.flags.FLAGS
It perhaps skips the word id and proceeds. @abisee - will it be the right next step to increase the vocab count to 200k and start the train, eval process again?
Wondering if there is anyone out there who was able to reproduce the results. I'm currently using chunked data and I played with 200/50 and 400/100 for the sequence lengths. I ran several times into NaN issues (after 35K, around 70K and around 145K steps). I restarted each time, last time with a saved checkpoint from around 110K step. I find this rather painful and wondering why this happens.
Any insights? @abisee Thank you in advance for all the work!
Hi @ioana-blue, I too have lost a lot of time trying to get things to run. Make sure you have the latest version of the code and data. There were changes recently to handle empty articles which cause some of the NaN issues. You may also want to look at this: https://github.com/abisee/pointer-generator/issues/4
I found that helpful in getting around the NaN issues at 35k. I was able to train the model to over 200k iterations after making that change.
@hate5six thank you for your input. my code is up-to-date, I think. I added the modification that you suggested from #4 . I had used it before with no success, I added it back, let's see. I'll update here when I have news.
@hate5six Do you get reasonable results when decoding with the trained model for 200K iterations?
@ioana-blue my model reached it last night and I'm just starting to investigate its performance now. I did run with coverage for an additional 3k iterations this morning, as advised by the authors. The generated summaries on the test set are readable, however, I'm also running the model on some of my own data and I'm seeing that the results are largely extractive. If anything the model removes some excess but it doesn't seem to be generating text. I have an issue/example at: https://github.com/abisee/pointer-generator/issues/21
@makcbe: that FLAGS
error was fixed in this commit. Make sure you pull it and other fixes.
@makcbe:
@abisee - will it be the right next step to increase the vocab count to 200k and start the train, eval process again?
Increasing the vocabulary size from 50k up to 200k will make everything run significantly slower. In our paper we found that (at least for the baseline model) a vocabulary of 150k didn't work better than a vocabulary of 50k. If I had to guess, I don't think increasing the vocabulary would help much.
As you found, that Error: model produced a word ID that isn't in the vocabulary
message was printing because there was a small bug regarding FLAGS
. I think nothing was actually going wrong with the vocabulary and word IDs. So increasing vocabulary was a red herring - sorry for confusion.
EDIT: tagged the wrong person.
@abisee I think your previous message was meant for @makcbe if I'm not mistaken :)
@hate5six did you manage to replicate the results?
I trained the model with all default parameter (then, it's the pointer model, right?!) and added some coverage training. The decoding give results far from what is published:
ROUGE-1:
rouge_1_f_score: 0.3142 with confidence interval (0.3122, 0.3161)
rouge_1_recall: 0.3368 with confidence interval (0.3345, 0.3391)
rouge_1_precision: 0.3218 with confidence interval (0.3193, 0.3241)
ROUGE-2:
rouge_2_f_score: 0.1240 with confidence interval (0.1222, 0.1255)
rouge_2_recall: 0.1339 with confidence interval (0.1320, 0.1357)
rouge_2_precision: 0.1267 with confidence interval (0.1249, 0.1283)
ROUGE-l:
rouge_l_f_score: 0.2855 with confidence interval (0.2837, 0.2874)
rouge_l_recall: 0.3060 with confidence interval (0.3036, 0.3082)
rouge_l_precision: 0.2927 with confidence interval (0.2903, 0.2948)
@pltrdy I did, qualitatively at least. The results I was seeing looked similar to what's in the paper/blog post and the final loss with/without coverage agreed, too. I had trouble installing pyrouge so my analysis stopped short of comparing the actual ROUGE scores, unfortunately.
@hate5six could you then provide your decoded files? Would be interested to score it myself.
@pltrdy it looks like I only ran it on 1500 validation articles. I'd be happy to share them but it also wouldn't make for a complete comparison against the published results. Let me know. I could process the whole validation set but I don't have an ETA on that at the moment given my current workload/resource availability.
Or maybe could you just upload your best checkpoint files? I would run decoding & scoring then (and share the result with you if you want)
@pltrdy I trained according the given number of steps in paper and I was able to get Rouge-1 F1 score = 37.34 | Rouge-2 F1 score = 16.92 Which is 2.19/0.36 points less than paper. I can share the checkpoint files if needed.
I used Java Rouge calculator. I couldn't figure out how to use the pyrouge. Do you have the files/instructions need to use pyrouge? Thanks
@abhishekraok interesting, could I know how did you trained your model? (how many training steps? how many coverage training steps?)
You can find PyRouge instructions here: https://pypi.python.org/pypi/pyrouge/0.1.3 you basically need to download ROUGE and to tell pyrouge where it is located using: pyrouge_set_rouge_path /absolute/path/to/ROUGE-1.5.5/directory
.
I'm personnally using my own files2rouge tool.
Actually I managed to figure out how to get the Perl Rouge files. I have added a PR for that to be added to the readme. Here are the numbers
ROUGE-1: rouge_1_f_score: 0.3834 with confidence interval (0.3812, 0.3856) rouge_1_recall: 0.4362 with confidence interval (0.4335, 0.4390) rouge_1_precision: 0.3640 with confidence interval (0.3616, 0.3665)
ROUGE-2: rouge_2_f_score: 0.1668 with confidence interval (0.1647, 0.1689) rouge_2_recall: 0.1900 with confidence interval (0.1876, 0.1923) rouge_2_precision: 0.1587 with confidence interval (0.1566, 0.1608)
ROUGE-l: rouge_l_f_score: 0.3500 with confidence interval (0.3478, 0.3521) rouge_l_recall: 0.3978 with confidence interval (0.3953, 0.4003) rouge_l_precision: 0.3325 with confidence interval (0.3302, 0.3348)
@pltrdy I trained for 230K steps without coverage and 3k steps with coverage as was mentioned in the paper. I have modified the code in my fork to run for exact number of steps. Thanks for the info on Rouge.
You may be interested in the pretrained model that is now available -- see README.
@abhishekraok Thank you for sharing your modifications, could you share a checkpoint for us to evaluate the model? Thank you very much.
@abhishekraok, @ioana-blue and @hate5six Did any of you train the baseline model for this? Due to TF version issue author could not publish the pre-trained model for the baseline architecture so it would be really great if any of you did it and is ready to share it. Thank you.
@jasonwbw @LeenaShekhar I have uploaded the files from which I got the metrics here https://1drv.ms/f/s!AizkA_PooBXSlIJO82kT0PZ_IboUrA It uses TF 1.2.0
@abhishekraok Sorry for such a late reply. Just to be sure, are those files for the baseline model? Thank you so much for sharing them.
Sorry, this is not the baseline model. This is one with the pointer generator and coverage. I tried to get the best metrics possible.
Has anyone been able to successfully replicate the model from the paper? I've been training for about two weeks (over 240k iterations) using the published parameters, along with training an additional ~3k iterations with coverage. Here is what my training loss looks like:
Unclear what caused the increase starting around iteration 180k, but even then the output was not looking great.
Here are some REF (gold) and DEC (system) summaries. As you can see, they are qualitatively bad. Unfortunately, at the moment, I can't figure out how to get pyrouge to run so I can't quantify the performance relative to the published results.
If anyone has had success reproducing the published model I would love to hear how you did it. I'm stumped.