Open cmhungsteve opened 6 years ago
Training set is 604 GB
(392k clips), downloaded with this improved script. Scaling by the number of clips, that would make validation ~46 GB
(30k clips) and test ~92 GB
(60k clips).
Thank you for the reply. What do you mean by "scaling by the number of clips"?
No problem. I just meant that I had only downloaded the training set videos, so I was estimating the validation and test set sizes using the number of clips as given by the annotation files (i.e. multiplying by 30/392 to get the validation size, and 60/392 to get the test set size). I ended up downloading the validation clips and can confirm they're just over 46 GB
total.
Got it. Thank you. Do you know the main difference between the improved script mentioned above and the one in your repo?
Kinetics is made up of 10-second clips from full YouTube videos. The original script downloads the full video for each example, then extracts the 10-second clip once it's downloaded. The improved script by @jremmons only downloads the 10-second clip you need.
You can see the line changes here: https://github.com/activitynet/ActivityNet/pull/16
I see. Thank you so much.
Note: I did not manage to get clean* videos with that script. Would be nice to see if the effect of that on classification accuracy.
@cmhungsteve I think I fixed the issue that @escorciav mentioned with my latest commits.
https://github.com/jremmons/ActivityNet/blob/master/Crawler/Kinetics/download.py
@jremmons Thank you!!
@jremmons FWIW I sampled about 20 videos downloaded with the old script, and never saw the artifacts referred to in the other post (viewing in QuickTime Player). Any insight into why I might not have had the issue? Am I just getting lucky in sampling videos with no artifacts?
@chrischute out of curiosity, did you take care of sampling video with t_start significantly different than zero? for the record, I used VLC or fedora27-default video player aided with non-free codecs such as x264.
@escorciav I did try sampling a video with t_start greater than 0 (abseiling/YEgqBGmmPV8
). There were no artifacts. I ran the download script on a Ubuntu machine, Python 3.6 and ffmpeg 3.1.3-static. I downloaded the mp4 to my mac and viewed it in QuickTime Player.
@escorciav I also didn't notice any issues with the first script I wrote. The current version of my script does re-encoding now like the original download.py
did just without downloading the entire youtube video (this is just to be safe). If someone can provide an example of a video where this issue occurs that would help a lot.
It would be a huge performance win for most people if the script doesn't have to re-encode. If we can't reproduce this problem it might be worth going back to a version that doesn't do re-encoding.
TLDR as we agree in the other PR, It's great to have this alternative to download and clip the videos. If anyone wanna start quickly please use it. Take my words as a disclaimer note that comes in the agreements that we usually don't read 😉
I am really happy with your contribution and comments. My original comment was more scientific/engineering question rather than a note to prevent the usage of this script.
I don't have too much bandwidth to test it in the next two weeks, I will try to do it but do not promise anything. If you have a docker image or conda environment please share it, that would reduce the amount of work.
@jremmons I tried your script but all the downloaded files are just like empty text files. I also tried to print "status" at line 137, and it all showed something like this: "('QcVuxQAgrzU_000007_000017', False, b'')". However, I have no problem downloading using the old script (I used the same command). Can you help me figure out what the problem is? Thank you.
@cmhungsteve that is strange... I literally just used the same script today to download kinetics-600 without any issues. Are you sure you are using the most up to date version here? If you have questions about my script though, we should chat on #16.
I am not really sure why. Is it because my FFmpeg version is not correct or I miss some library?
Here is the info shown in ffmpeg -version
:
ffmpeg version 3.4.2-1~16.04.york0.2 Copyright (c) 2000-2018 the FFmpeg developers
built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.9) 20160609
configuration: --prefix=/usr --extra-version='1~16.04.york0.2' --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
libavutil 55. 78.100 / 55. 78.100
libavcodec 57.107.100 / 57.107.100
libavformat 57. 83.100 / 57. 83.100
libavdevice 57. 10.100 / 57. 10.100
libavfilter 6.107.100 / 6.107.100
libavresample 3. 7. 0 / 3. 7. 0
libswscale 4. 8.100 / 4. 8.100
libswresample 2. 9.100 / 2. 9.100
libpostproc 54. 7.100 / 54. 7.100
and my youtube-dl version is "2018.6.25", which I think is the newest.
I know this is a rather open ended question, but I was looking to get some guidance on the time that it takes to download the entire dataset (for e.g if num-jobs=24 on an AWS p2 instance with 8 cores), thank you, @cmhungsteve @jremmons
@cmhungsteve that is strange... I literally just used the same script today to download kinetics-600 without any issues. Are you sure you are using the most up to date version here? If you have questions about my script though, we should chat on #16.
Do you manage to download all the clips? Because when I try to download the dataset, around 10% of the clips cannot be downloaded either since video is unavailable, copyright issues or the user closed the account. Is it normal?
@okankop yeah....there are lots of videos with copyright issues. I think it's normal.
An update on the stats using @jremmons version of the download script.
Training set: 589GB (380802 clips) Validation set: 45GB (29097 clips) Test set: 19GB (12617 clips)
While inspecting downloaded videos, I found out that joblib's parallelism would damage ffmpeg's transcoding (with url stream) and yield corrupted videos. The problem was solved by replacing joblib to python's built-in multiprocessing module.
An update on the stats using @jremmons version of the download script.
Training set: 589GB (380802 clips) Validation set: 45GB (29097 clips) Test set: 19GB (12617 clips)
Hi, Can you share Kinetics-600 data file? Thanks a lot ! @MannyKayy
Hi, Can you share Kinetics-600 data file? Thanks a lot ! @MannyKayy
What is the recommended way of sharing this dataset? It's roughly 0.6 TB and I am having trouble making a torrent of this dataset.
Maybe, we should reach the CVDF foundation. Probably, they can host the data as they have done for other datasets.
Please thumbs up this message if you deem it essential. It would help to make a strong case.
@sophia-wright-blue were you able to make it run on the p2 instance? I was running into too many requests issue, which I created an issue for https://github.com/activitynet/ActivityNet/issues/51#issue-471902064 @cmhungsteve @jremmons It seems you had did not have issues with sending too many requests from youtube-dl's side. Was the machine you used for downloading personal or a server?
Hi @hollowgalaxy , I was able to download it on an Azure VM, good luck!
Thanks for letting me know, @sophia-wright-blue, I just tried it. Ran an Standard D4s v3 (4 vcpus, 16 GiB memory) and it did not work.
@hollowgalaxy , I think I played around with that number a bit, I don't remember what finally worked, you might also wanna try this repo: https://github.com/Showmax/kinetics-downloader which worked for me
Apparently, pretty recently Youtube has started to extensively block large-scale downloading using youtube-dl. I have tried using the Crawler code for Kinetics and am always getting HTTP 429 ERROR. So, it does not matter which approach/code you use, Youtube apparently just does not allow systematic downloading. It would be great if ActivityNet hosts videos on some server so researchers would still be able to use Kinetics.
@MahdiKalayeh Could you pls confirm if #51 is the same error message that you got?
@escorciav Yes, the same error.
Let's track it there. Thanks 😉
@MannyKayy where you able to upload your download of kinetics600 so we may download it from there? thanks
I contacted Kinetics maintainers, and they are aware of the request. The ball is on their side. I will follow up with them by the end of the week.
I contacted Kinetics maintainers, and they are aware of the request. The ball is on their side. I will follow up with them by the end of the week.
It's been a month so I guess that possibility's gone out of the window by now...?
Any update on this track?
https://github.com/activitynet/ActivityNet/issues/28#issuecomment-549141529
Regarding ☝️ , I haven't heard back from them officially. My feeling is that the maintainers have knocked multiple doors, and have not found any solution yet.
The most viable solutions, that I'm aware, are:
I would ask some questions (they might be repeated by someone's).
Currently, I downloaded "val" set from kinetics-600 dataset and got 28k clips (3.9G). Is it correct?
Got error at final step (save download_report.json)
Traceback (most recent call last): File "download.py", line 220, in <module> main(**vars(p.parse_args())) File "download.py", line 200, in main fobj.write(json.dumps(status_lst)) File "/usr/lib/python3.5/json/__init__.py", line 230, in dumps return _default_encoder.encode(obj) File "/usr/lib/python3.5/json/encoder.py", line 198, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.5/json/encoder.py", line 256, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.5/json/encoder.py", line 179, in default raise TypeError(repr(o) + " is not JSON serializable") TypeError: b'' is not JSON serializable
What is this report? Is it very important?
Got error when downloading "test" split, said
FileNotFoundError: [Errno 2] No such file or directory: '3c4ab9ca-5eb6-4525-8e4d-ac4111536577.mp4.part
Anyone got similar one?
Thanks in advance.
I am also working with Kinetics dataset for an academic purpose. Also, I got same error (429). Can you please share with us via Google Drive or something else? @kaiqiangh
Hi @mahsunaltin, I also have this issue and cannot download the whole dataset. Not sure how to solve it.
When you wrote like Currently, I downloaded "val" set from kinetics-600 dataset and got 28k clips (3.9G). Is it correct?
, I thought you already downloaded the whole val dataset. @kaiqiangh
Hi, I checked the log files and also found some errors that lead to the incompleteness of val dataset. And then I re-run the codes, and the val dataset has been overrode. I am still working on it. By the way, I tried to download videos from my other server, but still get 429 error. Do you have any solution for that?
I already tried kind of techniques to download dataset, and as everyone I got 429 error. In the fact, if we can change ip address in every 50 videos, there will be no problem. So that, I have kind of tricky way to download dataset by using Colab. ( I know not very logical :) but so far so good )
Regarding Activitynet It was published , here you go : https://drive.google.com/file/d/12YOTnPc4zCwum_R9CSpZAI9ppAei8KMG/view
If anyone cannot download samples with error 429, you can use --cookies to download it. URL is https://daveparrish.net/posts/2018-06-22-How-to-download-private-YouTube-videos-with-youtube-dl.html.
BTW, it seems that a lot of videos are private videos that cannot be accessed. How can we download the private videos to make the dataset complete?
Thanks @Katou2! Do you mean that some of those videos are private -> cannot be accessed by anybody except the uploader (and youtube of course)?
I am wondering how large Kinetics-600 is. I am downloading it now and finished around 330G. I saw someone said Kinetics-400 is around 311G. Does that mean Kinetics-600 is around 470G? Just curious about that. Thank you.