CNNDailymail data: Predicted summaries are lists of single words and lead to rouge score of zero

shandou commented 5 years ago

Thank you very much for providing the latest updates to the repo. I am still having trouble training the model on a small subset of CNNDailymail data. Upon inferencing, the model keeps producing predictions that are lists of single words. I am providing more details below:

How I ran the code:

train_and_eval.py --infer_source_file /home/shan/datasets/NLP/dev_CNNDM_sequenceGGNN/jsonl/test/inputs.jsonl.gz --infer_predictions_file /home/shan/datasets/NLP/dev_CNNDM_sequenceGGNN/jsonl/test/predictions.jsonl

The spurious single-word predictions:

Validation predictions...
[['at'], ['at'], ['at'], ['at'], ['5.3million'], ['5.3million'], ['5.3million'], ['at'], ['at'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['at'], ['5.3million'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['5.3million'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['5.3million'], ['5.3million'], ['at'], ['rehahn'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['rehahn'], ['at'],
(rest of stdout omitted)

Whereas the target summaries should have been parsed properly. For example:

Targets...
[['Lord', 'Mervyn', 'Davies,', '62,', 'was', 'at', 'a', 'Royal', 'Academy', 'of', 'Arts', 'party', 'last', 'night.', 'Singer', 'Usher', 'had', 'been', 'speaking', 'to', 'group', 'of', 'young', 'people', 'at', 'charity', 'event.', 'Labour', 'peer', 'showed', 'off', 'his', 'fancy', 'footwork', 'on', 'the', 'dance', 'floor.', 'Usher', 'will', 'finish', 'his', 'tour', 'with', 'a', 'concert', 'at', 'the', 'O2', 'tonight', '.'], ['Craig', 'MacLean,', '22,', 'was', 'on', 'flight', 'to', 'Abu', 'Dhabi', 'when', 'staff', 'called', 'for', 'doctor.', 'The', 'medical', 'student', 'stepped', 'in', 'to', 'help', 'when', 'man', 'suffered', 'a', 'cardiac', 'arrest.', 'Dundee', 'University', 'student', 'started', 'trying', 'to', 'revive', 'the', 'passenger', 'at', '36,000ft.', 'KLM', 'flight', 'from', 'Scotland', 'diverted', 'to', 'Turkey', 'and', 'man', 'received', 'medical', 'care.'],
(and so on)

The error messages: The rouge score ends up being zero and the training quickly reports error:


eval loss: 8.41, eval rouge: 0.00
early stopping triggered...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/workspace/GGNN_text_summarizer/train_and_eval.py in <module>
625 
626 if __name__ == "__main__":
--> 627     main()

~/workspace/GGNN_text_summarizer/train_and_eval.py in main() 212 213 if args.infer_source_file is not None: --> 214 infer(model, args) 215 216

~/workspace/GGNN_text_summarizer/train_and_eval.py in infer(model, args) 487 # saver = tf.train.Saver(max_to_keep=100) 488 saver = tf.train.Saver(max_to_keep=1) --> 489 saver.restore(session, os.path.join(args.checkpoint_dir, "best.ckpt")) 490 491 # build eval graph, loss and prediction ops

~/software/anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow/python/training/saver.py in restore(self, sess, save_path) 1266 if not checkpoint_management.checkpoint_exists(compat.as_text(save_path)): 1267 raise ValueError("The passed save_path is not a valid checkpoint: " -> 1268 + compat.as_text(save_path)) 1269 1270 logging.info("Restoring parameters from %s", compat.as_text(save_path))

ValueError: The passed save_path is not a valid checkpoint: cnndailymail_summarizer/best.ckpt



Would you mind providing some insights on what might have caused this issue? Thanks!

CoderPat commented 5 years ago

Hmm thats weird, did you run into similar problems @ioana-blue ?

ioana-blue commented 5 years ago

No, right now I get coherent messages that are not aligned with the code, so I get about .14 rouge-2. however I'm trying this for a tiny dataset (about 10k training samples).

shandou commented 5 years ago

Thank you very much for getting back to me 😃 It could be that both my training data and the number of training iterations are too small (as I want to make sure that I am using the entire pipeline properly). @ioana-blue @CoderPat In your opinion:

If you have to provide a ballpark estimate: What is the minimum viable training data size for the NLP task?
How many training steps would you need to start to get sensible predictions?
Are the default hyperparameters in train_and_eval a good starting point for the NLP task?

I also find coreNLP annotation a pretty intense process computationally. Wonder if you could provide insights as to why this is the case and if structural annotations, in general, are expected to be this intense.

Thanks a lot!!

CoderPat commented 5 years ago

Even though a good amount of data is necessary for good results you shouldn't be seeing always the same words. They were for the full dataset, but one thing I've noticed about graph models is that they are much more sensible to hyperparameter optimizations. I think I'll have some free time in the upcoming weeks so I'll try to retrain the model on the full _cnndailymail dataset to see if I catch any more bugs and upload a checkpoint I will try to upload a checkpoint

shandou commented 5 years ago

Great!! I'll also tinker more in parallel and check with you again later. Thanks a lot! :)

CoderPat commented 5 years ago

@shandou I think found the problem, is some weird issue with tensorflow not checkpointing some variables (might be caused by a new version of tensorflow). I assume @ioana-blue doesnt get it since she does inference in the same run as training. while try to investigate soon and fix it

shandou commented 5 years ago

Many thanks for looking into this! I haven't been able to spend a lot of time on the codes this week 😞 but plan to come back to it this weekend. Please keep me posted!

ioana-blue commented 5 years ago

That's right, so far I've been doing inference in the same run. But at some point it would be nice to do inference after loading a checkpoint. In fact, I did run it like this but only for debugging (not looking at overall accuracy).

shellycsy commented 5 years ago

hello，I want to discuss some issues with you,，Can I talk to you privately?Do you have an email or WeChat? thank you~

CoderPat commented 5 years ago

Sure, it is in my GitHub profile

shellycsy commented 5 years ago

I have the same problem as you: eval loss: 7.43, eval rouge: 0.00 early stopping triggered...

Can you solve this problem?

CoderPat / structured-neural-summarization

CNNDailymail data: Predicted summaries are lists of single words and lead to rouge score of zero #21