coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 639 forks source link

Download unlimited number of youtube channels or playlists should not be the default #464

Open jcline-ieee opened 6 years ago

jcline-ieee commented 6 years ago

Lack of good internal default option when using youtube-dl will cause edx-dl to download huge playlists or entire channels if the professor has innocently used a link to a playlist or channel.

Your environment

Steps to reproduce

Download a course which includes a link to a youtube channel or playlist.

A couple courses include links to external materials on youtube and if these links are playlists (or channel link), it can result in downloading absolutely huge amounts of unwanted video not related to the course at all.

Another course seemingly had a youtube (channel or playlist) link in order to attribute share-alike music used in the course, to the original youtube author, which resolved to a huge playist of videos. Because this attribution occurred several times in the course, the amounts of junk download was ridiculous, because youtube-dl was downloading every share-alike music video from that external channel.

A game design course linked into some gamer's playthru of a specific game, but the link went to a playlist of hundreds of large unrelated videos as well.

Expected behaviour

Not recurse into massive youtube playlists. Default to only downloading first video from a youtube channel link. Which can be overridden by a specific option if desired.

Actual behaviour

Downloads complete rubbish channels or playlists.

The below is not desirable and unexpected: Downloading 671 videos of non-course related content, by default. And then in the following course week, downloading them again.. and again..

` [skipping] https://courses.edx.org/courses/course-v1:HarvardX+SPU30x+3T2016/xblock/block-v1:HarvardX+SPU30x+3T2016+type@video+block@ad20679adbe544a19e5be0378566bac7/handler/transcript/translation/en => Downloaded/Super-Earths_and_Life/05-The_Search_for_Life/04-HARSPU30T214-G000700_100.en.srt [skipping] https://d2f1egay8yehza.cloudfront.net/har-spu30/HARSPU30T214-V003200_DTH.mp4 => Downloaded/Super-Earths_and_Life/06-Wrap-Up/01-HARSPU30T214-V003200_DTH.mp4 [skipping] https://courses.edx.org/courses/course-v1:HarvardX+SPU30x+3T2016/xblock/block-v1:HarvardX+SPU30x+3T2016+type@video+block@1380b48b550b441184c488597b824fe8/handler/transcript/translation/en => Downloaded/Super-Earths_and_Life/06-Wrap-Up/01-HARSPU30T214-V003200_DTH.en.srt [download] https://www.youtube.com/channel/UCX6b17PVsYBQ0ip5gyeme-Q => Downloaded/Super-Earths_and_Life/06-Wrap-Up/02-%(title)s-%(id)s.%(ext)s Downloading video with URL https://www.youtube.com/channel/UCX6b17PVsYBQ0ip5gyeme-Q from YouTube. [youtube:channel] UCX6b17PVsYBQ0ip5gyeme-Q: Downloading channel page [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading webpage [download] Downloading playlist: Uploads from CrashCourse [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #1 [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #2 [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #3 [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #4 [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #5 [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #6 [youtube:playlist] UUX6b17PVsYBQ0ip5gyeme-Q: Downloading page #7 [youtube:playlist] playlist Uploads from CrashCourse: Downloading 671 videos [download] Downloading video 1 of 671

.....

[download] 100% of 72.18MiB [download] Downloading video 23 of 671 `

Patch

This worked for me. Edit: sorry my patch quoting below is not working. it is so obvious though, I'm sure the idea is clear.


*** edx_dl/common.py    2016-04-19 08:24:12.000000000 -0700
--- edx_dl-20170521/common.py   2017-11-21 12:49:28.000000000 -0800
*************** class ExitCode(object):
*** 170,176 ****
      NO_DOWNLOADABLE_VIDEO = 6

! YOUTUBE_DL_CMD = ['youtube-dl', '--ignore-config']
  DEFAULT_CACHE_FILENAME = 'edx-dl.cache'
  DEFAULT_FILE_FORMATS = ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt',
                          'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg',
--- 170,176 ----
      NO_DOWNLOADABLE_VIDEO = 6

! YOUTUBE_DL_CMD = ['youtube-dl', '--ignore-config', '--no-playlist', '--no-check-certificate']
  DEFAULT_CACHE_FILENAME = 'edx-dl.cache'
  DEFAULT_FILE_FORMATS = ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt',
                          'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg',

*** edx_dl/edx_dl.py    2017-11-26 01:39:48.000000000 -0800
--- edx_dl-20170521/edx_dl.py   2017-11-21 12:49:28.000000000 -0800
*************** def parse_args():
*** 313,319 ****
      parser.add_argument('--youtube-dl-options',
                          dest='youtube_dl_options',
                          action='store',
!                         default='',
                          help='set extra options to pass to youtube-dl')

      parser.add_argument('--prefer-cdn-videos',
--- 315,321 ----
      parser.add_argument('--youtube-dl-options',
                          dest='youtube_dl_options',
                          action='store',
!                         default='--max-downloads=2',  
                          help='set extra options to pass to youtube-dl')

      parser.add_argument('--prefer-cdn-videos',
balta2ar commented 6 years ago

I vaguely remember there were issues with --no-playlist option, but I can't recollect the details... Alternatively you could do this: https://github.com/coursera-dl/edx-dl/issues/285#issuecomment-133485413

jcline-ieee commented 6 years ago

--max-downloads=2 is what makes it work well for me.

jcline-ieee commented 6 years ago

This entire problem of downloading youtube channels, could be greatly reduced if the youtube links were scanned for duplicates. The following course reproduces this problem, course-v1:MITx+11.126x_2+1T2016. It includes a link to the same youtube channel every week. "Downloading 684 videos" Not only is the playlist huge but it is re-downloaded in every subdirectory. This course is part of a course series and each one of the courses ends up attempting to download this huge channel in every section.

[download] Destination: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-So You Want To Be a Game Designer - Career Advice for Making Games - Extra Credits-zQvWMdWhFCc.mp4
[download] 100% of 14.47MiB in 01:50.49KiB/s ETA 00:00nown ETA
[download] https://www.youtube.com/user/ExtraCreditz => edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-%(title)s-%(id)s.%(ext)s
Downloading video with URL https://www.youtube.com/user/ExtraCreditz from YouTube.
[youtube:user] ExtraCreditz: Downloading channel page
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading webpage
[download] Downloading playlist: Uploads from Extra Credits
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading page #1
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading page #2
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading page #3
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading page #4
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading page #5
[youtube:playlist] UUCODtTcd5M1JavPCOr_Uydg: Downloading page #6
[youtube:playlist] playlist Uploads from Extra Credits: Downloading 684 videos
[download] Downloading video 1 of 684
[youtube] BtY3Lto_lR4: Downloading webpage
[youtube] BtY3Lto_lR4: Downloading video info webpage
[youtube] BtY3Lto_lR4: Extracting video information
[youtube] BtY3Lto_lR4: Downloading MPD manifest
[info] Writing video description to: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-The Warhammer License (Again) - Did Games Workshop's Gamble Work - Extra Credits-BtY3Lto_lR4.description
[info] Writing video subtitles to: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-The Warhammer License (Again) - Did Games Workshop's Gamble Work - Extra Credits-BtY3Lto_lR4.es-419.vtt
[info] Writing video description metadata as JSON to: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-The Warhammer License (Again) - Did Games Workshop's Gamble Work - Extra Credits-BtY3Lto_lR4.info.json
[download] Destination: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-The Warhammer License (Again) - Did Games Workshop's Gamble Work - Extra Credits-BtY3Lto_lR4.mp4
[download] 100% of 15.45MiB in 01:58.00KiB/s ETA 00:00nown ETA
[download] Downloading video 2 of 684
[youtube] a6sLFt6Fro4: Downloading webpage
[youtube] a6sLFt6Fro4: Downloading video info webpage
[youtube] a6sLFt6Fro4: Extracting video information
WARNING: video doesn't have subtitles
[youtube] a6sLFt6Fro4: Downloading MPD manifest
[info] Writing video description to: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-Frankenstein - Paradise Lost - Extra Sci Fi - #5-a6sLFt6Fro4.description
[info] Writing video description metadata as JSON to: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-Frankenstein - Paradise Lost - Extra Sci Fi - #5-a6sLFt6Fro4.info.json
[download] Destination: edx/Introduction_to_Game_Design/03-Week_1-_What_Are_Games/08-Frankenstein - Paradise Lost - Extra Sci Fi - #5-a6sLFt6Fro4.mp4
[download]  12.4% of 23.35MiB at 176.53KiB/s ETA 01:58
...
salahoued commented 6 years ago

Hi Guys, Today I came across this problem when trying to download this course [https://courses.edx.org/courses/course-v1:MITx+6.00.1x+2T2017_2/course/], but there was some youtube channels so youtube-dl start downloading them. As a workaround I have used the --export-filename option, removed the youtube channels links manually and passed the file with -i -x16 to aria2c, the only problem now is the file names I can't tell which one of them is the first, second... etc. without playing each one of them.

any one have some solution? thanks!