Provided test set output is different from the output generated using the pre-trained models.

LeenaShekhar commented 6 years ago

I used the pre-trained models (TF 1.2.1) to decode and calculated ROUGE scores for those. These outputs look different from the ones provided under the test_output folder.

For example:

REFERENCE SUMMARY: in an interview with the new york times , president obama says he understands israel feels particularly vulnerable . obama calls the nuclear deal with iran a `` once-in-a-lifetime opportunity '' israeli prime minister benjamin netanyahu and many u.s. republicans warn that iran can not be trusted .

Generated summary using the pre-trained model for pre-coverage:

president barack obama says he is absolutely committed to making sure '' israel maintains a military advantage over iran . obama says he is absolutely committed to making sure they maintain their qualitative military edge ''

Generated summary under test_output/pre-coverage folder:

president barack obama says he is absolutely committed to making sure '' israel maintains a military advantage over iran . obama said he understands and respects netanyahu 's stance that israel is particularly vulnerable and does n't have the luxury of testing these propositions '' .

Because of the difference in generated summaries ROUGE scores look different as well (much lower than the reported value).

Reported ROUGE scores (calculated on the given output under test_output/pre-coverage folder):

rouge_1_f_score: 0.3644 with confidence interval (0.3619, 0.3666) rouge_1_recall: 0.3760 with confidence interval (0.3732, 0.3787) rouge_1_precision: 0.3776 with confidence interval (0.3749, 0.3804)

ROUGE-2: rouge_2_f_score: 0.1566 with confidence interval (0.1543, 0.1589) rouge_2_recall: 0.1612 with confidence interval (0.1587, 0.1636) rouge_2_precision: 0.1631 with confidence interval (0.1607, 0.1655)

ROUGE-l: rouge_l_f_score: 0.3342 with confidence interval (0.3319, 0.3366) rouge_l_recall: 0.3446 with confidence interval (0.3418, 0.3473) rouge_l_precision: 0.3466 with confidence interval (0.3439, 0.3493)

ROUGE scores on the output from the pre-trained model

ROUGE-1: rouge_1_f_score: 0.3577 with confidence interval (0.3555, 0.3601) rouge_1_recall: 0.3716 with confidence interval (0.3690, 0.3744) rouge_1_precision: 0.3689 with confidence interval (0.3662, 0.3717)

ROUGE-2: rouge_2_f_score: 0.1528 with confidence interval (0.1508, 0.1551) rouge_2_recall: 0.1584 with confidence interval (0.1562, 0.1607) rouge_2_precision: 0.1585 with confidence interval (0.1562, 0.1609)

ROUGE-l: rouge_l_f_score: 0.3254 with confidence interval (0.3232, 0.3279) rouge_l_recall: 0.3378 with confidence interval (0.3354, 0.3405) rouge_l_precision: 0.3359 with confidence interval (0.3333, 0.3386)

I ran the model out of box without any changes. Is there anything I am missing?

LeenaShekhar commented 6 years ago

@wchowdhu In my case generated summaries were different but not bad like yours. I have a feeling you are not loading the ckpt, check your ckpt path and make sure graph is initialized with those values rather than random ones.

huyingxi commented 6 years ago

I have the same question, using the pre-trained model provided on the website to make predictions, and the results obtained are quite different from the results in the paper. I don't know if my operation is incorrect or for other reasons.

LeenaShekhar commented 6 years ago

Hi @huyingxi, I am not sure how different is "quite different" for you. I am not working on it currently, but as you could see from my post above things were different by 1.5% in terms of results when I simply ran the pre-trained model without introducing any changes from my end.

I have a feeling since code base was re-written due to TF version switch, model might have been affected. This was with the model 5-6 months back though.

All the best.

hanghang2333 commented 6 years ago

I guess it may be that the results of the pretrained model using the Python2 version of TensorFlow training were different when invoked by Python3, but I'm not sure. Hope someone can verify.（In fact, I have actually encountered this situation before.）

LeenaShekhar commented 6 years ago

Sorry I did not mean Python2 and Python3 difference. I think the author ported the code from TF older version to new version and that might have introduced something. Again, this is my guess based on a few mail exchanges with the author long back on a different issue.

hanghang2333 commented 6 years ago

@LeenaShekhar Hello,I have another problem.When testing Rouge scores, ground truth has a few sentences, that is, multiple “@highlight“ separated in the original data file. Looking at the code during training is to combine multiple ground truth sentences into one for training. So when testing Rouge scores, should I put multiple sentences in ground truth like [some_name.A.001.txt, some_name.B.001.txt, some_name.C.001.txt]. Or combine them into a single file. And as for the results of the tests which I put them into a single file, the Rouge1 score and the Rouge2 score are still normal, but the Rouge L score is very low.

abisee / pointer-generator

Provided test set output is different from the output generated using the pre-trained models. #63