activitynet / ActivityNet

This repository is intended to host tools and demos for ActivityNet
MIT License
941 stars 330 forks source link

size of Kinetics-600 #28

Open cmhungsteve opened 6 years ago

cmhungsteve commented 6 years ago

I am wondering how large Kinetics-600 is. I am downloading it now and finished around 330G. I saw someone said Kinetics-400 is around 311G. Does that mean Kinetics-600 is around 470G? Just curious about that. Thank you.

chrischute commented 6 years ago

Training set is 604 GB (392k clips), downloaded with this improved script. Scaling by the number of clips, that would make validation ~46 GB (30k clips) and test ~92 GB (60k clips).

cmhungsteve commented 6 years ago

Thank you for the reply. What do you mean by "scaling by the number of clips"?

chrischute commented 6 years ago

No problem. I just meant that I had only downloaded the training set videos, so I was estimating the validation and test set sizes using the number of clips as given by the annotation files (i.e. multiplying by 30/392 to get the validation size, and 60/392 to get the test set size). I ended up downloading the validation clips and can confirm they're just over 46 GB total.

cmhungsteve commented 6 years ago

Got it. Thank you. Do you know the main difference between the improved script mentioned above and the one in your repo?

chrischute commented 6 years ago

Kinetics is made up of 10-second clips from full YouTube videos. The original script downloads the full video for each example, then extracts the 10-second clip once it's downloaded. The improved script by @jremmons only downloads the 10-second clip you need.

You can see the line changes here: https://github.com/activitynet/ActivityNet/pull/16

cmhungsteve commented 6 years ago

I see. Thank you so much.

escorciav commented 6 years ago

Note: I did not manage to get clean* videos with that script. Would be nice to see if the effect of that on classification accuracy.

jremmons commented 6 years ago

@cmhungsteve I think I fixed the issue that @escorciav mentioned with my latest commits.

https://github.com/jremmons/ActivityNet/blob/master/Crawler/Kinetics/download.py

cmhungsteve commented 6 years ago

@jremmons Thank you!!

chrischute commented 6 years ago

@jremmons FWIW I sampled about 20 videos downloaded with the old script, and never saw the artifacts referred to in the other post (viewing in QuickTime Player). Any insight into why I might not have had the issue? Am I just getting lucky in sampling videos with no artifacts?

escorciav commented 6 years ago

@chrischute out of curiosity, did you take care of sampling video with t_start significantly different than zero? for the record, I used VLC or fedora27-default video player aided with non-free codecs such as x264.

chrischute commented 6 years ago

@escorciav I did try sampling a video with t_start greater than 0 (abseiling/YEgqBGmmPV8). There were no artifacts. I ran the download script on a Ubuntu machine, Python 3.6 and ffmpeg 3.1.3-static. I downloaded the mp4 to my mac and viewed it in QuickTime Player.

jremmons commented 6 years ago

@escorciav I also didn't notice any issues with the first script I wrote. The current version of my script does re-encoding now like the original download.py did just without downloading the entire youtube video (this is just to be safe). If someone can provide an example of a video where this issue occurs that would help a lot.

It would be a huge performance win for most people if the script doesn't have to re-encode. If we can't reproduce this problem it might be worth going back to a version that doesn't do re-encoding.

escorciav commented 6 years ago

TLDR as we agree in the other PR, It's great to have this alternative to download and clip the videos. If anyone wanna start quickly please use it. Take my words as a disclaimer note that comes in the agreements that we usually don't read 😉

I am really happy with your contribution and comments. My original comment was more scientific/engineering question rather than a note to prevent the usage of this script.

I don't have too much bandwidth to test it in the next two weeks, I will try to do it but do not promise anything. If you have a docker image or conda environment please share it, that would reduce the amount of work.

cmhungsteve commented 6 years ago

@jremmons I tried your script but all the downloaded files are just like empty text files. I also tried to print "status" at line 137, and it all showed something like this: "('QcVuxQAgrzU_000007_000017', False, b'')". However, I have no problem downloading using the old script (I used the same command). Can you help me figure out what the problem is? Thank you.

jremmons commented 6 years ago

@cmhungsteve that is strange... I literally just used the same script today to download kinetics-600 without any issues. Are you sure you are using the most up to date version here? If you have questions about my script though, we should chat on #16.

cmhungsteve commented 6 years ago

I am not really sure why. Is it because my FFmpeg version is not correct or I miss some library? Here is the info shown in ffmpeg -version: ffmpeg version 3.4.2-1~16.04.york0.2 Copyright (c) 2000-2018 the FFmpeg developers built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.9) 20160609 configuration: --prefix=/usr --extra-version='1~16.04.york0.2' --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared libavutil 55. 78.100 / 55. 78.100 libavcodec 57.107.100 / 57.107.100 libavformat 57. 83.100 / 57. 83.100 libavdevice 57. 10.100 / 57. 10.100 libavfilter 6.107.100 / 6.107.100 libavresample 3. 7. 0 / 3. 7. 0 libswscale 4. 8.100 / 4. 8.100 libswresample 2. 9.100 / 2. 9.100 libpostproc 54. 7.100 / 54. 7.100

and my youtube-dl version is "2018.6.25", which I think is the newest.

sophia-wright-blue commented 6 years ago

I know this is a rather open ended question, but I was looking to get some guidance on the time that it takes to download the entire dataset (for e.g if num-jobs=24 on an AWS p2 instance with 8 cores), thank you, @cmhungsteve @jremmons

okankop commented 5 years ago

@cmhungsteve that is strange... I literally just used the same script today to download kinetics-600 without any issues. Are you sure you are using the most up to date version here? If you have questions about my script though, we should chat on #16.

Do you manage to download all the clips? Because when I try to download the dataset, around 10% of the clips cannot be downloaded either since video is unavailable, copyright issues or the user closed the account. Is it normal?

cmhungsteve commented 5 years ago

@okankop yeah....there are lots of videos with copyright issues. I think it's normal.

MannyKayy commented 5 years ago

An update on the stats using @jremmons version of the download script.

Training set: 589GB (380802 clips) Validation set: 45GB (29097 clips) Test set: 19GB (12617 clips)

dandelin commented 5 years ago

While inspecting downloaded videos, I found out that joblib's parallelism would damage ffmpeg's transcoding (with url stream) and yield corrupted videos. The problem was solved by replacing joblib to python's built-in multiprocessing module.

lesoleil commented 5 years ago

An update on the stats using @jremmons version of the download script.

Training set: 589GB (380802 clips) Validation set: 45GB (29097 clips) Test set: 19GB (12617 clips)

Hi, Can you share Kinetics-600 data file? Thanks a lot ! @MannyKayy

xiaoyang-coder commented 5 years ago

Hi, Can you share Kinetics-600 data file? Thanks a lot ! @MannyKayy

MannyKayy commented 5 years ago

What is the recommended way of sharing this dataset? It's roughly 0.6 TB and I am having trouble making a torrent of this dataset.

escorciav commented 5 years ago

Maybe, we should reach the CVDF foundation. Probably, they can host the data as they have done for other datasets.

Please thumbs up this message if you deem it essential. It would help to make a strong case.

hollowgalaxy commented 5 years ago

@sophia-wright-blue were you able to make it run on the p2 instance? I was running into too many requests issue, which I created an issue for https://github.com/activitynet/ActivityNet/issues/51#issue-471902064 @cmhungsteve @jremmons It seems you had did not have issues with sending too many requests from youtube-dl's side. Was the machine you used for downloading personal or a server?

sophia-wright-blue commented 5 years ago

Hi @hollowgalaxy , I was able to download it on an Azure VM, good luck!

hollowgalaxy commented 5 years ago

Thanks for letting me know, @sophia-wright-blue, I just tried it. Ran an Standard D4s v3 (4 vcpus, 16 GiB memory) and it did not work.

sophia-wright-blue commented 5 years ago

@hollowgalaxy , I think I played around with that number a bit, I don't remember what finally worked, you might also wanna try this repo: https://github.com/Showmax/kinetics-downloader which worked for me

MahdiKalayeh commented 5 years ago

Apparently, pretty recently Youtube has started to extensively block large-scale downloading using youtube-dl. I have tried using the Crawler code for Kinetics and am always getting HTTP 429 ERROR. So, it does not matter which approach/code you use, Youtube apparently just does not allow systematic downloading. It would be great if ActivityNet hosts videos on some server so researchers would still be able to use Kinetics.

escorciav commented 5 years ago

@MahdiKalayeh Could you pls confirm if #51 is the same error message that you got?

MahdiKalayeh commented 5 years ago

@escorciav Yes, the same error.

escorciav commented 5 years ago

Let's track it there. Thanks 😉

MStumpp commented 5 years ago

@MannyKayy where you able to upload your download of kinetics600 so we may download it from there? thanks

MannyKayy commented 5 years ago

@MStumpp Unfortunately not. @escorciav It may be worth it if the CDVF reaches out to the authors for a copy of the full original kinetics dataset.

escorciav commented 5 years ago

I contacted Kinetics maintainers, and they are aware of the request. The ball is on their side. I will follow up with them by the end of the week.

sailordiary commented 5 years ago

I contacted Kinetics maintainers, and they are aware of the request. The ball is on their side. I will follow up with them by the end of the week.

It's been a month so I guess that possibility's gone out of the window by now...?

tyyyang commented 5 years ago

Any update on this track?

escorciav commented 5 years ago

https://github.com/activitynet/ActivityNet/issues/28#issuecomment-549141529

Regarding ☝️ , I haven't heard back from them officially. My feeling is that the maintainers have knocked multiple doors, and have not found any solution yet.

The most viable solutions, that I'm aware, are:

0xMarsRover commented 4 years ago

I would ask some questions (they might be repeated by someone's).

  1. Currently, I downloaded "val" set from kinetics-600 dataset and got 28k clips (3.9G). Is it correct?

  2. Got error at final step (save download_report.json) Traceback (most recent call last): File "download.py", line 220, in <module> main(**vars(p.parse_args())) File "download.py", line 200, in main fobj.write(json.dumps(status_lst)) File "/usr/lib/python3.5/json/__init__.py", line 230, in dumps return _default_encoder.encode(obj) File "/usr/lib/python3.5/json/encoder.py", line 198, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.5/json/encoder.py", line 256, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.5/json/encoder.py", line 179, in default raise TypeError(repr(o) + " is not JSON serializable") TypeError: b'' is not JSON serializable What is this report? Is it very important?

  3. Got error when downloading "test" split, said FileNotFoundError: [Errno 2] No such file or directory: '3c4ab9ca-5eb6-4525-8e4d-ac4111536577.mp4.part Anyone got similar one?

Thanks in advance.

mahsunaltin commented 4 years ago

I am also working with Kinetics dataset for an academic purpose. Also, I got same error (429). Can you please share with us via Google Drive or something else? @kaiqiangh

0xMarsRover commented 4 years ago

Hi @mahsunaltin, I also have this issue and cannot download the whole dataset. Not sure how to solve it.

mahsunaltin commented 4 years ago

When you wrote like Currently, I downloaded "val" set from kinetics-600 dataset and got 28k clips (3.9G). Is it correct?, I thought you already downloaded the whole val dataset. @kaiqiangh

0xMarsRover commented 4 years ago

Hi, I checked the log files and also found some errors that lead to the incompleteness of val dataset. And then I re-run the codes, and the val dataset has been overrode. I am still working on it. By the way, I tried to download videos from my other server, but still get 429 error. Do you have any solution for that?

mahsunaltin commented 4 years ago

I already tried kind of techniques to download dataset, and as everyone I got 429 error. In the fact, if we can change ip address in every 50 videos, there will be no problem. So that, I have kind of tricky way to download dataset by using Colab. ( I know not very logical :) but so far so good )

AmeenAli commented 4 years ago

Regarding Activitynet It was published , here you go : https://drive.google.com/file/d/12YOTnPc4zCwum_R9CSpZAI9ppAei8KMG/view

KT27-A commented 4 years ago

If anyone cannot download samples with error 429, you can use --cookies to download it. URL is https://daveparrish.net/posts/2018-06-22-How-to-download-private-YouTube-videos-with-youtube-dl.html.

BTW, it seems that a lot of videos are private videos that cannot be accessed. How can we download the private videos to make the dataset complete?

eglerean commented 4 years ago

Thanks @Katou2! Do you mean that some of those videos are private -> cannot be accessed by anybody except the uploader (and youtube of course)?