cvdfoundation / kinetics-dataset

756 stars 96 forks source link

Missing Videos in train, val, and test #51

Closed Ridhamz-nd closed 3 weeks ago

Ridhamz-nd commented 4 months ago

non_existing_test.txt non_existing_train.txt non_existing_val.txt

(base) ubuntu@ridham.exp.bf16.2:~/datasets/k400$ cat annotations/val.csv | cut -f 2 -d, | sort | uniq | wc
  19907   19907  238883
(base) ubuntu@ridham.exp.bf16.2:~/datasets/k400$ ls val_ext/ | grep mp4 | sort | uniq | wc
  19881   19881  596430
(base) ubuntu@ridham.exp.bf16.2:~/datasets/k400$ cat annotations/test.csv | cut -f 2 -d, | sort | uniq | wc
  39806   39806  477671
(base) ubuntu@ridham.exp.bf16.2:~/datasets/k400$ ls test_ext | grep mp4 | wc
  38685   38685 1160550
(base) ubuntu@ridham.exp.bf16.2:~/datasets/k400$ cat annotations/train.csv | cut -f 2 -d, | sort | uniq | wc
 246535  246535 2958419
(base) ubuntu@ridham.exp.bf16.2:~/datasets/k400$ ls train_ext/ | grep mp4 | sort | uniq | wc
 241258  241258 7237740

This also holds true for train and test. I've attached a script to find non existing videos and 3 files with youtube ids for each split whose corresponding video doesn't exist.

import sys
import os
import pandas as pd

df = pd.read_csv(sys.argv[1])
video_ids = df.youtube_id

listed_items = os.listdir(sys.argv[2])
extracted_youtube_ids = set()
for listed_item in listed_items:
    extracted_youtube_ids.add(listed_item.split('_00')[0]) # kind of hacky
for video_id in video_ids:
    if video_id not in extracted_youtube_ids:
        print(video_id)
Ridhamz-nd commented 4 months ago

note that my command line output shows that there is a discrepancy of 5277 youtube ids (246535-241258) whereas my non_existing_train.txt contains 5286. Not really sure about this discrepancy so any help is much appreciated.

daviduarte commented 3 weeks ago

Missing files also here (Kinetics 400). For me it is missing a total of 8304 videos (train + val + test). In your 3 files I summed a total of 6432 videos.

daviduarte commented 3 weeks ago

Duplicated. Just update the .csv files without the missing files.