cvdfoundation / kinetics-dataset

716 stars 92 forks source link

Many videos in the Kinetics700-2020 are shorter than 10 seconds #4

Closed hukkai closed 2 years ago

hukkai commented 3 years ago

Hi, many videos in the Kinetics700-2020 are shorter than 10 seconds, but they are supposed to be 10 seconds long. In the test split, the percentage is over 25%. Here are some examples that are shorter than 8 seconds.

Kinetics700-2020-test/v55ikd_-Rc4_000141_000151.mp4 Kinetics700-2020-test/52mb2tRzayU_000106_000116.mp4 Kinetics700-2020-test/9k3bdcoMTVY_000013_000023.mp4 Kinetics700-2020-test/f9FftpAwmws_000074_000084.mp4 Kinetics700-2020-test/714LsaiTVVk_000002_000012.mp4 Kinetics700-2020-test/7WtqdnyTXjY_000004_000014.mp4 Kinetics700-2020-test/bbaRarfa-X0_000073_000083.mp4 Kinetics700-2020-test/xKnk1UYdgac_000000_000010.mp4 Kinetics700-2020-test/Pf5jowvNpiE_000013_000023.mp4 Kinetics700-2020-test/A1CQslN-Xbw_000010_000020.mp4 Kinetics700-2020-test/aJw7fScmOGo_000007_000017.mp4 Kinetics700-2020-test/2bI8oYlrWjs_000000_000010.mp4 Kinetics700-2020-test/KV8RVTRTAL0_000007_000017.mp4 Kinetics700-2020-test/rAgdt5mqCwA_000048_000058.mp4 Kinetics700-2020-test/LhW0hADHePo_000000_000010.mp4 Kinetics700-2020-test/Fo7EYCBwDaw_000135_000145.mp4 Kinetics700-2020-test/72PEZjijk8o_000002_000012.mp4 Kinetics700-2020-test/3-3e71B5yBo_000000_000010.mp4 Kinetics700-2020-test/d61S7amsWsM_000003_000013.mp4 Kinetics700-2020-test/191VnlH8z68_000002_000012.mp4 Kinetics700-2020-test/QV6D9MoUlH4_000042_000052.mp4 Kinetics700-2020-test/flXQJFDjw1E_000001_000011.mp4 Kinetics700-2020-test/iCgHfcLhnDU_000318_000328.mp4 Kinetics700-2020-test/6vemGexYgHI_000003_000013.mp4 Kinetics700-2020-test/2AxfjxBvh10_000000_000010.mp4 Kinetics700-2020-test/4LFQuxKfFIQ_000261_000271.mp4 Kinetics700-2020-test/4QYmCBN1nHQ_000046_000056.mp4 Kinetics700-2020-test/cPd1GhGV4Fg_000011_000021.mp4 Kinetics700-2020-test/4V7JPYZBnCM_000014_000024.mp4 Kinetics700-2020-test/3xcQj9HZP5Y_000000_000010.mp4 Kinetics700-2020-test/1LaRLvgZTjI_000114_000124.mp4 Kinetics700-2020-test/8uGAZkuoXVg_000078_000088.mp4 Kinetics700-2020-test/42vZ8I-jRPg_000034_000044.mp4 Kinetics700-2020-test/1f-5jxwtibg_000262_000272.mp4 Kinetics700-2020-test/6_T1NJTMNuc_000000_000010.mp4 Kinetics700-2020-test/1F4REb4pqo0_000001_000011.mp4 Kinetics700-2020-test/3OPqFdZlaNY_000075_000085.mp4 Kinetics700-2020-test/JE8h-yGd25w_000000_000010.mp4 Kinetics700-2020-test/9PVi6qiS7zM_000006_000016.mp4 Kinetics700-2020-test/0hMk37By7t4_000021_000031.mp4 Kinetics700-2020-test/Pd_gOf0TY7M_000050_000060.mp4 Kinetics700-2020-test/KdD5HVxwaQE_000018_000028.mp4 Kinetics700-2020-test/caBITzNkOis_000014_000024.mp4 Kinetics700-2020-test/3lGPnnsf9Y8_000004_000014.mp4 Kinetics700-2020-test/1OvQ9_ZgnIA_000000_000010.mp4 Kinetics700-2020-test/AkIhOrNcbUA_000020_000030.mp4 Kinetics700-2020-test/M45S-HkcwTM_000049_000059.mp4 Kinetics700-2020-test/FOa1tk1Isi0_000038_000048.mp4 Kinetics700-2020-test/OgXl2BKdUoU_000012_000022.mp4 Kinetics700-2020-test/uaKPPePpSY0_000006_000016.mp4 Kinetics700-2020-test/-_D7UCii3FU_000021_000031.mp4 Kinetics700-2020-test/3Hr-2TpgVEE_000057_000067.mp4 Kinetics700-2020-test/1Je9mL8Uudo_000000_000010.mp4 Kinetics700-2020-test/N1IGDSJoia0_000000_000010.mp4 Kinetics700-2020-test/9EiQCNi4bOA_000023_000033.mp4 Kinetics700-2020-test/0C9EO_A2PIY_000004_000014.mp4 Kinetics700-2020-test/B0n-nS4Y6xs_000000_000010.mp4 Kinetics700-2020-test/45E3EdNaoHg_000013_000023.mp4 Kinetics700-2020-test/6hpPVBBGZ74_000009_000019.mp4 Kinetics700-2020-test/a1jyH4CJJR4_000000_000010.mp4 Kinetics700-2020-test/AzQ6mn_6ZKc_000000_000010.mp4 Kinetics700-2020-test/0zr5-JyS0Xc_000047_000057.mp4 Kinetics700-2020-test/43D0gnE5Z7o_000083_000093.mp4 Kinetics700-2020-test/IVW_Yk2lyDg_000000_000010.mp4 Kinetics700-2020-test/2R45XkkgbAQ_000045_000055.mp4 Kinetics700-2020-test/8N6-DeT6mXs_000048_000058.mp4 Kinetics700-2020-test/6ATIhv4DFjo_000034_000044.mp4 Kinetics700-2020-test/3E9AdPkiz9o_000000_000010.mp4 Kinetics700-2020-test/5XgnD4P9B-M_000005_000015.mp4 Kinetics700-2020-test/AB305H8Np48_000040_000050.mp4 Kinetics700-2020-test/3OezYSbd_n4_000064_000074.mp4 Kinetics700-2020-test/Z8e-EfVlIx0_000000_000010.mp4 Kinetics700-2020-test/6m_8FNc2scg_000137_000147.mp4 Kinetics700-2020-test/K43n8RqxbFQ_000101_000111.mp4 Kinetics700-2020-test/kgIEx-OjPG0_000000_000010.mp4 Kinetics700-2020-test/0nLH52UNKhw_000000_000010.mp4 Kinetics700-2020-test/5V7GTuihlQQ_000002_000012.mp4 Kinetics700-2020-test/1hZV-H5yl6s_000000_000010.mp4 Kinetics700-2020-test/COZqe2f1Axg_000031_000041.mp4 Kinetics700-2020-test/29GNPtZaqS4_000001_000011.mp4 Kinetics700-2020-test/83J0uf8cJlI_000025_000035.mp4 Kinetics700-2020-test/6Zl5jX9fjKE_000139_000149.mp4 Kinetics700-2020-test/1AGYst8AKCc_000000_000010.mp4 Kinetics700-2020-test/cEmdLm8cBNE_000037_000047.mp4 Kinetics700-2020-test/1FUiMeIu7sE_000011_000021.mp4 Kinetics700-2020-test/5JBC5X0O73k_000005_000015.mp4 Kinetics700-2020-test/Cpn-XAerL5I_000011_000021.mp4 Kinetics700-2020-test/aFqlkvgQKho_000000_000010.mp4 Kinetics700-2020-test/aUOo5M67Itc_000010_000020.mp4 Kinetics700-2020-test/BJaHpp_K148_000190_000200.mp4 Kinetics700-2020-test/AivUke09tz8_000019_000029.mp4 Kinetics700-2020-test/f5WiwscpVlE_000000_000010.mp4 Kinetics700-2020-test/44esMhYjLRs_000019_000029.mp4 Kinetics700-2020-test/p-koaErOtiI_000075_000085.mp4 Kinetics700-2020-test/aC6__nAesz8_000103_000113.mp4 Kinetics700-2020-test/CrnGGdO3C4M_000030_000040.mp4 Kinetics700-2020-test/8YbNZ3lm7Ts_000000_000010.mp4 Kinetics700-2020-test/CAUQyTTat2M_000011_000021.mp4 Kinetics700-2020-test/CZbXx9UW2FE_000146_000156.mp4 Kinetics700-2020-test/7JYYa4C5u4A_000003_000013.mp4

TheShadow29 commented 2 years ago

@hukkai were you able to figure why this is the case?

hukkai commented 2 years ago

@TheShadow29 They are download failure, and require re-download.

ShoufaChen commented 2 years ago

Hi, @hukkai

I noticed that these files are in .tar.gz format.

If download failure exists, can it be extracted successfully?

hukkai commented 2 years ago

@ShoufaChen, if download failure exists, I will delate the file and re-download.

ShoufaChen commented 2 years ago

@hukkai Thanks for your reply. However, I was wondering whether the "shorter period than 10s" problem you mentioned above is caused by the download failure.

From my experience, if download failure exists, the extraction procedure will crack, instead of generating videos less than 10s.

hukkai commented 2 years ago

@ShoufaChen I try to download k700_test_001.tar.gz, and test some videos in this tar file. I think I download this file successfully, and the md5sum of k700_test_001.tar.gz is 1384c0c1ff1fe0463e17f52fdd79236e. I checked two mp4 files with md5sum as following:

5a6bcdf0769d06c52400af4b7a0b7d4e  ./--HHKLZPagg_000029_000039.mp4
d51476957096504f46f1ca1545c87cea  ./-IN5x2nji3I_000003_000013.mp4

Both two videos are shorter than 5 seconds. I checked the original videos, they should be 10 seconds long.

ShoufaChen commented 2 years ago

Hi, @hukkai

When you said redownload, do you mean that you download from the original youtube link?

hukkai commented 2 years ago

@ShoufaChen I remove my local file downloaded from this repo's link and download again, watch the progress bar and make sure no download failure happens:

wget https://s3.amazonaws.com/kinetics/700_2020/test/k700_test_001.tar.gz
ShoufaChen commented 2 years ago

I see. Thank you very much for your reply.

hukkai commented 2 years ago

@ShoufaChen Could you download the tar file and check the md5sum with my result? I do not think wget will fail so frequently without any info.

Edit: I downloaded the file from three severs in west-US, east-US, and China, resulting in the same md5sum. Thus I think it is more likely that the original download of this repo failed and many broken videos were saved.

ShoufaChen commented 2 years ago

So how do you solve this problem?

As I mentioned above, when you get the normal mp4 files (about 10s), do you redownload them from the original youtube link?

ShoufaChen commented 2 years ago

@TheShadow29 They are download failure, and require re-download.

  1. How do you re-download?

Using the link provided by this repo or downloading from original youtube?

  1. Is this issue solved by the re-download way you used in 1?
hukkai commented 2 years ago

@ShoufaChen

  1. I first check each video's length, if it is shorter than 10 seconds or fewer than 300 frames, I re-download the video from YouTube.
  2. Yes. By fixing this issue we got 2% test accuracy improvement on test set (test accuracy obtained from the ActivityNet challenge server).
ShoufaChen commented 2 years ago

I see. Thanks.

kinetics-cvdf commented 2 years ago

Thanks for identifying this issue. We have now replaced most of the videos shorter than 10 seconds (which were basically corrupted). There are still 3000 videos (5% of the entire testing set) shorter than 9 seconds - either because the raw video from youtube is short, or the video is not accessible anymore. This is as it should be.