facebookresearch / Ego4d

Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
https://ego4d-data.org/docs/
MIT License
340 stars 47 forks source link

EM-NLQ data - Missing samples #64

Closed lbaermann closed 2 years ago

lbaermann commented 2 years ago

Dear Ego4D-Team,

thanks again for the dataset! I am currently working with the NLQ data, and it seems the number of samples does not match the numbers reported in the paper.

>>> import json
>>> x = json.load(open('nlq_train.json'))
>>> len([c for v in x['videos'] for c in v['clips']])
998
>>> len([q for v in x['videos'] for c in v['clips'] for a in c['annotations'] for q in a['language_queries']])
11296
When doing the same with nlq_val.json, I get the following numbers: train val
#clips 998 328
#queries 11296 3875
In contrast, in the paper you report (Table 6 in the appendix, p. 27) train val
#clips 2.1k 0.7k
#queries 14.7k 5.1k

Am I missing something?

Furthermore, I noticed that a small amount of NLQ annotations have no query:

>>> [a['annotation_uid'] + '_' + str(i) for v in x['videos'] for c in v['clips'] for a in c['annotations'] for i, q in enumerate(a['language_queries']) if 'query' not in q or not q['query']]
['7b63ccdc-9246-448b-9068-d6e05d6cbe5f_4', '51b44048-723b-4101-b1f1-0c73c35800b1_2', '390ded20-fb92-482a-8c65-89b517a69706_5', '43381ede-b3d9-4d66-8d7e-7ae2f9cf1f0d_5', '2a2a09d6-fe45-4176-9350-c8870aada15d_3']

in train and 6d490d22-8ba3-4f13-82ad-ab1ec9074cfe_8 in val

Thanks for your help in advance!

ebyrne commented 2 years ago

We appreciate the detailed question Leonard. And you're absolutely right, the current arxiv paper reflects an earlier version of the dataset. There is an arxiv update coming (I believe tonight actually!) that should certainly line up with the numbers you're seeing, which look accurate to me.

(Some of the data was withheld from the earliest version of the dataset, and a much larger set of annotations will be available later this year in a dataset update.)

Please do let us know once that's out if you notice any other gaps!