dataset - Githubissues

alasdairtran / transform-and-tell

[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning

https://transform-and-tell.ml/

91 stars 14 forks source link

dataset #7

Open xuewyang opened 4 years ago

xuewyang commented 4 years ago

Hi Alasdair,

I observed the same issue with what you mentioned in the paper about goodnews dataset: "Many of the articles in GoodNews are partially extracted because the generic article extraction library failed to recognize some of the HTML tags specific to The New York Times."

Have you tried to re-crawl these data via the link?
Is there similar issue with NYT800K? Thanks.

alasdairtran commented 4 years ago

Since about 94% of the captions in GoodNews are also in NYTimes800k (the other 6% now point to dead links I think), you can (almost) reconstruct a cleaner version of GoodNews by taking a subset of NYTimes800k.

We don't have the same issue with NYTimes800k (I wrote a custom parser that takes care of the corner cases). To see this, you can select all articles in NYTimes800k that also appear in GoodNews, and you will that the average article length in the NYTimes800k subset is 960, whereas is it's only 450 in GoodNews.

In the paper, we didn't fix the GoodNews dataset because we want to compare to the numbers presented in the GoodNews paper.

xuewyang commented 4 years ago

Gotcha, that is good to know.

xuewyang commented 4 years ago

Hi, I am wondering about why the mongodb dataset is over 40 GB. As far as I know from the goodnews_flattened.py, we only have articles, image ids, and captions. We can retrieve them using the following codes. I got them stored in json, and it only takes about 3 GB. Can you explain this? Thanks.

    for sample_id in ids:
        sample = self.db.splits.find_one({'_id': {'$eq': sample_id}})

        # Find the corresponding article
        article = self.db.articles.find_one({
            '_id': {'$eq': sample['article_id']},
        }, projection=['_id', 'context', 'images', 'web_url'])

        # Load the image
        image_path = os.path.join(self.image_dir, f"{sample['_id']}.jpg")
        try:
            image = Image.open(image_path)
        except (FileNotFoundError, OSError):
            continue

        yield self.article_to_instance(article, image, sample['image_index'], image_path)

alasdairtran commented 4 years ago

The database contains the pretrained face embeddings and object embeddings, which are used in the full model. All of the captions and article texts also contain POS and NER annotations from spacy.

xuewyang commented 4 years ago

Hi, For NYT800K dataset, is location_aware applied to nytimes_faces_ner_matched.py? Is here used to extract the 512 tokens around the image? If I want to extract 1000 tokens, I just change the 512 to 1000?

Thank you.

alasdairtran commented 4 years ago

Yes location_aware is implemented in nytimes_faces_ner_matched.py. You can see that the code tries to extract the text above the image into the list before, and the text below the image into the list after.

Yes change it to 1000 if you want 1000 tokens. But note that there's another hard cutoff in the token indexer here because bert/roberta encoders only support text with max 512 tokens.

xuewyang commented 4 years ago

Yes, thank you.