alasdairtran / transform-and-tell

[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning
https://transform-and-tell.ml/
89 stars 14 forks source link

The number of articles of NYTimes800K #36

Open tjuwyh opened 2 years ago

tjuwyh commented 2 years ago

Hi, Alasdair. I find that the number of articles contained in "nytimes-2020-04-21.gz" does not agree with the number reported in your paper. In Table 2 of "Transform and tell", the number of training, validation, and test splits are 433561, 2978, and 8375, but the MongoDB backup file you provided contains 434314, 3052, and 8495 articles. Did I do something wrong? Or why did the number of articles in the NYTimes dataset grow?

alasdairtran commented 2 years ago

Some articles don't have an image. The numbers in Table 2 of our paper only reports those with at least one image and one caption.

If you'd like to reproduce the numbers in Table 2, you can run the function compute_nytimes_stats in this script. As you can see, it exclude those without an image and without a caption from the count.