furkanbiten / GoodNews

Good News Everyone! - CVPR 2019
127 stars 21 forks source link

After I run clean_captions.py the news_dataset.json file is not created #18

Closed monajalal closed 4 years ago

monajalal commented 4 years ago

I don't get any error and I see print of captions. However, only these are created (val.json is also not created):

Vice President Joseph R. Biden Jr. in March at a Jordanian-American military training center in Zarqa, Jordan.
The funeral in Rimoun, Jordan, for Anwar Abu Zaid, a police captain who was killed after he attacked a police training center in November. American and Jordanian officials said they believed that the weapons he used had been meant for a program to train Syrian rebels.
Ambulances leaving the police training center where Captain Abu Zaid gunned down five people, including two American contractors.
Jeremy Beck and Rebecca Noelle Brinkley in the Mint Theater production of “Hindle Wakes.”
Priscilla Chan and Mark Zuckerberg say their charity plan is to "advance human potential and promote equality in areas such as health, education, scientific research and energy."
The official Rio 2016mona@mona:~/research/GoodNews/prepocess$ ls ../data
total 1021M
-rw-rw-r--  1 mona mona   71 Sep  7 12:28 .gitignore
-rw-rw-r--  1 mona mona 985M Sep  7 12:29 captioning_dataset.json
drwxrwxr-x 12 mona mona 4.0K Sep  7 12:47 ..
-rw-rw-r--  1 mona mona  36M Sep  7 21:21 article_urls.json
drwxrwxr-x  2 mona mona 4.0K Sep  7 21:22 .
-rw-rw-r--  1 mona mona    0 Sep  9 10:23 test.json

Do you know what's the reason and how to fix it? I have used absolute path as well but no luck!

    with open("/home/mona/research/GoodNews/data/captioning_dataset.json", "rb") as f:
        captioning_dataset = json.load(f)

    for k, anns in tqdm.tqdm(captioning_dataset.items()):

        for ix, img in anns['images'].items():
            try:
                split = get_split()

                #         import ipdb; ipdb.set_trace()
                img = preprocess_sentence(img)
                template, full = NER(' '.join(img))
                if len(' '.join(template)) != 0:
                    news_data.append({'filename': k + '_' + ix + '.jpg', 'filepath': 'resized', 'cocoid': counter,
                                      'imgid': k + '_' + ix, 'sentences': [], 'sentences_full': [],
                                      #                               'sentences_article':[],
                                      'split': split})
                    news_data[counter]['sentences'].append(
                        {'imgid': counter, 'raw': ' '.join(template), 'tokens': template})
                    news_data[counter]['sentences_full'].append(
                        {'imgid': counter, 'raw': ' '.join(full), 'tokens': full})
                    counter += 1
            except:
                print(img)
    split_to_ix = {i:n['split'] for i, n in enumerate(news_data)}
    # train = [news_data[k] for k, v in split_to_ix.items() if v =='train']
    val = [news_data[k] for k, v in split_to_ix.items() if v =='val']
    test = [news_data[k] for k, v in split_to_ix.items() if v =='test']
    with open("/home/mona/research/GoodNews/data/test.json", "wb") as f:
        json.dump(test, f)
    with open("/home/mona/research/GoodNews/data/val.json", "wb") as f:
monajalal commented 4 years ago

I changed the 'wb' and 'rb' to 'w' and 'r' and also printed values of 'val' and 'test' and now that news_dataset.json is dumped, everything is [] (null). not sure why!

monajalal commented 4 years ago

https://github.com/furkanbiten/GoodNews/issues/19