coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 640 forks source link

edx-dl not able to download videos from edx platform #559

Closed MATRIX30 closed 4 years ago

MATRIX30 commented 5 years ago

🚨Please review the Troubleshooting section before reporting any issue. Don't forget also to check the current issues to avoid duplicates.

Subject of the issue

edx-dl fails to extract and download videos for "https://courses.edx.org/courses/course-v1:EdinburghX+PA1.1x+3T2019/course/" on www.edx.org it seems the videos for this course are sourced from "https://media.ed.ac.uk/" and not youtube Need help on resolving this issue

Your environment

Steps to reproduce

--- create an account on Edx

--- enroll for the course "https://courses.edx.org/courses/course-v1:EdinburghX+PA1.1x+3T2019/course/"

---- type the following into CMD
edx-dl -u username -p password -o path --ignore-errors --cache https://courses.edx.org/courses/course-v1:EdinburghX+PA1.1x+3T2019/course/

Expected behaviour

download to start normally

Actual behaviour

edx_dl version 0.1.10 Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading Introduction to Predictive Analytics [course-v1:EdinburghX+PA1.1x+3T2019/co] Downloading 0 section(s) loading 2329 urls from cache [edx-dl.cache] Extracting all units information in parallel. No downloadable video found.

YukunXia commented 5 years ago

Having the same issue :(

mor3dr3ad commented 5 years ago

Confirmed with different url: https://courses.edx.org/courses/course-v1:MITx+14.750x+3T2019/course/

Output of --debug:

root[main] edx_dl version 0.1.10 root[parse_file_formats] file_formats: ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg', 'zip', 'rar', 'gz', 'mp3', 'R', 'Rmd', 'ipynb', 'py'] root[edx_get_headers] Building initial headers for future requests. root[_get_initial_token] Getting initial CSRF token. root[_get_initial_token] Found CSRF token. root[edx_get_headers] Headers built: {'User-Agent': 'edX-downloader/0.01', 'Accept': 'application/json, text/javascript, /; q=0.01', 'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8', 'Referer': 'https://courses.edx.org/login_ajax', 'X-Requested-With': 'XMLHttpRequest', 'X-CSRFToken': 'PUsSLjqYvxBtMFO07I7RfYRpxPPZdHE0zWBVoJk4aqqo8AOSciOeEoSTr49FvNeH'} root[edx_login] Logging into Open edX site: https://courses.edx.org/login_ajax root[get_courses_info] Extracting course information from dashboard. root[get_courses_info] Data extracted: ["lotsofcourseswhichidontwanttoshare"] root[get_available_sections] Extracting sections for :https://courses.edx.org/courses/course-v1:MITx+14.750x+3T2019/course/ root[get_available_sections] Extracted sections: [] root[_display_selections] Downloading Political Economy and Economic Development [course-v1:MITx+14.750x+3T2019/co] root[_display_sections] Downloading 0 section(s) root[extract_all_units_in_sequence] Extracting all units information in sequentially. root[extract_all_units_in_sequence] urls: [] root[parse_units] No downloadable video found.

adizukerman commented 5 years ago

Same issue with https://courses.edx.org/courses/course-v1:MITx+2.830.2x+3T2019/course/

ozhaggis commented 5 years ago

Same issue with multiple courses.

edx_dl version 0.1.10 Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading Data Science: Machine Learning [course-v1:HarvardX+PH125.8x+2T2019/co] Downloading 0 section(s) Extracting all units information in parallel. No downloadable video found.

EugeneLoy commented 5 years ago

Same issue. Course: https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/

It looks like edx-dl is missing most of the sections of the course. In my example, it sees only 1 section, while edx site displays more than 5 (at the moment):

> edx-dl.py -u <username> --list-sections https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/
edx_dl version 0.1.10
Password:
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Fundamentals of Statistics [course-v1:MITx+18.6501x+3T2019/co] has 1 sections so far
 1 - Download Entrance Survey videos
not-lucky commented 5 years ago

Here's mine...

Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading Calculus Applied! [course-v1:HarvardX+CalcAPL1x+2T2019/co] Downloading 3 section(s) Section 1: Optional Sections (CHOOSE 1 of 3) Optional Sections Section 2: Section 12: Course Wrap Up End of Course Survey Course Feedback Forum (Optional) Section 3: Acknowledgements Course Team and Special Thanks Section 1: What Makes a Good Test Question? Mathematical Models to Measure Knowledge and Improve Learning Section 2: Economic Applications of Calculus: Elasticity and A Tale of Two Cities Section 3: From X-rays to CT scans: Mathematics and Medical Imaging Section 4: What is Middle Income? Thinking about Income Distributions with Statistics and Calculus Section 5: Population Dynamics Part I: the Evolution of Population Models and Section 6: Population Dynamics II: A Biological Puzzle OR How Fishing Affects a Predator-Prey System Section 7: Extinction, Chaos and other Bifurcation Behavior, Section 8: Bifurcation Part II: Outbreak! Budworm Populations and Bifurcations, Section 9: Bifurcation Part III: Species in Competition: Coexistence or Exclusion Section 10: E = mc²: Taylor Approximation and the Energy Equation Final Assessments Extracting all units information in parallel. Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@944fb6867b354e2cafb41415aae41415' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@2101c542ac614691acc54224d3c314a8' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@5864500159ef40f9839d66d2492fea58' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@13aed97186fd4c7588a5ea1399e096df' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@a53371a01e9c4fd28dcb1a1609614da7' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@ebf2c858d37e418583f839965631108f' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@d4e29c075ff14ad583a3750767faf698' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@6cc97f049d444c4f8470b88ad3fdbc52' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@0f7edf523c55490e8380b6e9a809df33' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@fb7c4d1c1a2649b29e472b2ef86a36ce' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@edb436fadf2c4b74b175b9b5b6334b48' Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@1ccb65aca6b34beda14dedfa6bffafbc' Removed 0 duplicated urls from 0 in total Output directory: Downloaded

abeckman commented 5 years ago

Same issue with multiple courses.

lubaroli commented 5 years ago

Same issue here, edx-dl only sees the first section.

Heres the log: root[main] edx_dl version 0.1.10 root[parse_file_formats] file_formats: ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg', 'zip', 'rar', 'gz', 'mp3', 'R', 'Rmd', 'ipynb', 'py'] Password: root[edx_get_headers] Building initial headers for future requests. root[_get_initial_token] Getting initial CSRF token. root[_get_initial_token] Found CSRF token. root[edx_get_headers] Headers built: {'User-Agent': 'edX-downloader/0.01', 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8', 'Referer': 'https://courses.edx.org/login_ajax', 'X-Requested-With': 'XMLHttpRequest', 'X-CSRFToken': 'wWr0eKCgnA1uusK8rQvzPJHFK8bXmxn4i1pxyGtnuxsy0MRE8LXYh87mk8DN1eST'} root[edx_login] Logging into Open edX site: https://courses.edx.org/login_ajax root[get_courses_info] Extracting course information from dashboard. root[get_courses_info] Data extracted: [Fundamentals of Statistics: https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/, TOEFL Test Preparation: The Insider’s Guide: https://courses.edx.org/courses/course-v1:ETSx+TOEFLx+3T2017/course/, Minds and Machines: https://courses.edx.org/courses/course-v1:MITx+24.09x+3T2015/course/, Practical Learning Analytics: https://courses.edx.org/courses/course-v1:MichiganX+PLAx+2T2016/course/, Embedded Systems - Shape the World: https://courses.edx.org/courses/course-v1:UTAustinX+UT.6.03x+1T2016/course/, The Science of Everyday Thinking: https://courses.edx.org/courses/course-v1:UQx+Think101x+2T2015/course/, Electronic Interfaces: https://courses.edx.org/courses/course-v1:BerkeleyX+EE40LX+2T2015/course/, Autonomous Navigation for Flying Robots: https://courses.edx.org/courses/TUMx/AUTONAVx/2T2014/course/, Next Generation Infrastructures - Part 2: https://courses.edx.org/courses/DelftX/NGI102x/3T2014/course/, Solar Energy: https://courses.edx.org/courses/DelftX/ET.3034TU/3T2014/course/, Circuits and Electronics: https://courses.edx.org/courses/MITx/6.002_4x/3T2014/course/] root[get_available_sections] Extracting sections for :https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/ root[get_available_sections] Extracted sections: [<edx_dl.common.Section object at 0x1042f6110>] root[_display_selections] Downloading Fundamentals of Statistics [course-v1:MITx+18.6501x+3T2019/co] root[_display_sections] Downloading 1 section(s) root[_display_sections] Section 1: Entrance Survey root[_display_sections] 1. Entrance Survey root[extract_all_units_in_parallel] Extracting all units information in parallel. root[extract_all_units_in_parallel] urls: ['https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/jump_to/block-v1:MITx+18.6501x+3T2019+type@vertical+block@entrancesurvey-tab1'] root[extract_units] Processing 'https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/jump_to/block-v1:MITx+18.6501x+3T2019+type@vertical+block@entrancesurvey-tab1' root[main] Removed 0 duplicated urls from 0 in total root[download] Output directory: Downloaded

wzhuwz commented 5 years ago

Looks like edx-dl is missing most of the sections of the course. My case https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/course/.

Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading FA18: Deterministic Optimization [course-v1:GTx+ISYE6669+2T2018/co] Downloading 5 section(s) Section 1: Getting Started Welcome Message Syllabus Getting Help Getting to Know Each Other Section 2: Discussions and Q&A Discussions and Q&A Forums Section 3: Proctoring Information - Verified Learners Section 4: Midterm Exam - Verified Learners Section 5: Final Exam - Verified Learners Extracting all units information in parallel. Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@b4e0e428596e4a438b61d9c44a66ff45' Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@6e0eef9f7a9b4eed99ea9c1ad8e37b16' Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@d827bed0374e46b5a0abe62978b7cca8' Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@3247cb48d14b4f1e97bb9dd74d1ec8a2' Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@c49832c367cc47be96ba15a3ce5e9d8c' Removed 0 duplicated urls from 0 in total Output directory: Downloaded

dorianherle commented 5 years ago

I have the same issue:

edx_dl version 0.1.10
Password:
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Downloading Introduction to Discrete Choice Models [course-v1:EPFLx+DiscreteChoiceX+3T2017/co]
Downloading 0 section(s)
Extracting all units information in parallel.
No downloadable video found.
mor3dr3ad commented 5 years ago

So, I've dug into the code a bit and I think I found the issue: for some courses, edx has again updated the structure of their website. The issue is with line 397 in /edx-dl/.parsing.py

    sections_soup = soup.find_all('li', class_='outline-item section')

In the new format, the sections have a different class, namely "outline-item section scored".

Should be easily fixed. will try to hack sth together, but this better be checked by so experienced.

mor3dr3ad commented 5 years ago

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385: subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

not-lucky commented 5 years ago

Thanks a lot. Its working now.

malawadd commented 5 years ago

thank you it works now

malawadd commented 5 years ago

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385: subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

this partially works , it still misses some weeks and module i tried it on this course

https://courses.edx.org/courses/course-v1:CurtinX+MKT1x+1T2019/course/

and the entire module 3 didnt download

mor3dr3ad commented 5 years ago

@malawadd can you please share error messages/debug info? Do the sections just not download or does it exit with a message?

malawadd commented 5 years ago

@mor3dr3ad

it download an empty folder but skips all the content, then processed to downloading the following module and all it's content, there are no error messages or anything

mor3dr3ad commented 5 years ago

Just ran the course you mentioned and it seems to be working for me. Will do some more testing this week. In the meanwhile maybe download missing vids manually

malawadd commented 5 years ago

@mor3dr3ad do you mind telling me more about the testing you plan to run , because i would like to try and fix this but am not sure where to start nor what exactly i should look for.

mor3dr3ad commented 5 years ago

@malawadd well for starters you could help by providing some more debugging info by using the --debug flag when running edx with the course you mentioned and providing information.

For me, my fix is working, even with your course. So without being able to reproduce your error I can only assume there is a different issue (maybe using a different version of edx-dl?)

rbrito commented 5 years ago

If something fixes a program, why don't you submit your changes as a pull request to fix things (or get things slightly improved) for other users?

mor3dr3ad commented 5 years ago

Planning on doing exactly that sometime this week. Just a bit busy right now

-------- Original Message -------- From: "Rogério Brito" notifications@github.com Sent: 5 November 2019 15:52:11 CET To: coursera-dl/edx-dl edx-dl@noreply.github.com Cc: mor3dr3ad christof.weigelmeier@posteo.net, Mention mention@noreply.github.com Subject: Re: [coursera-dl/edx-dl] edx-dl not able to download videos from edx platform (#559)

If something fixes a program, why don't you submit your changes as a pull request to fix things (or get things slightly improved) for other users?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/coursera-dl/edx-dl/issues/559#issuecomment-549840707

rbrito commented 5 years ago

Thanks, please do and I can do a round of code review and merge everything. That will be awesome!

maxshatskiy commented 5 years ago

Hello,

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385: subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

This solution works for many courses, but now old courses are not supported: https://courses.edx.org/courses/course-v1:KTHx+DTS02.1x+1T2018/course/

adizukerman commented 5 years ago

For class https://courses.edx.org/courses/course-v1:MITx+2.830.2x+3T2019/course/ it worked partially. Not all videos and attachments were downloaded.

By the way, thank you to everyone who is working on this. This tool is so helpful as a time saver to allow working on classes offline.

WajdiBenSaad commented 5 years ago

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385: subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

This should be integrated into a new release. Edx has changed their website structure and this new change breaks all download operations with edx-dl.

antoniosereno commented 4 years ago

Thanks everyone! I'm facing the same issue and unfortunately the solution provided does not work with this course: https://courses.edx.org/courses/course-v1:EdinburghX+CCSx+3T2019/course/ any hint?

EugeneLoy commented 4 years ago

Hi. I've put together few pull requests that fix various issues with currently released (0.1.10) edx-dl

  1. 569 - quickfix from #568 that fixes login issues. This is a road-blocker for the next two and I'd like someone to review it, since I am not overly familiar with login process (as is original quickfix author).

  2. 570 - this fixes current issue. I tested it on courses I had problems with before, as well as some of the problematic ones that were reported here. Mentioned courses worked for me but in case someone has problems with other courses - let me know so I could update the fix.

  3. 556 - my earlier fix for broken resource file names.

@rbrito could you or other core contributors, please, review these PRs and release new version with these fixes included? The currently released version has been unusable for some time now and it would be great to release fixes for these breaking issues whenever possible.

Meanwhile, If someone needs a working version or is willing to test these fixes you can access cumulative fix with all of the above included here: https://github.com/EugeneLoy/edx-dl/tree/cummulative

EugeneLoy commented 4 years ago

@malawadd I've checked the course you are having problem with and it looks like some of the videos are no longer available:

[download] https://www.youtube.com/watch?v=N9SFeRNAfEA => Downloaded\Digital_Branding_and_Engagement\02-Module_1-_The_Digital_Consumer\02-%(title)s-%(id)s.%(ext)s
Downloading video with URL https://www.youtube.com/watch?v=N9SFeRNAfEA from YouTube.
[youtube] N9SFeRNAfEA: Downloading webpage
[youtube] N9SFeRNAfEA: Downloading video info webpage
WARNING: Unable to extract video title
WARNING: unable to extract description; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
ERROR: This video is no longer available because the YouTube account associated with this video has been terminated.
Sorry about that.

It is likely that your specific problem was caused by deletion of the video from youtube itself, not bug in edx-dl

antoniosereno commented 4 years ago

Hi @EugeneLoy , thank you for your help! May I ask if you were able to download this course? https://courses.edx.org/courses/course-v1:EdinburghX+CCSx+3T2019/course/ I'm having trouble with it but not with others

EugeneLoy commented 4 years ago

@antoniosereno yes, I've been able to download that course.

antoniosereno commented 4 years ago

Ok I've downloaded the edx-dl-cummulative, made everything you suggested and now it gives me an HTTP Error 400: Bad Request

Yesterday I was able to access the courses list, now I'm not able anymore..

It there anything I'm missing?

EugeneLoy commented 4 years ago

@antoniosereno are you sure you running code from cummulative branch of the repo and not the one installed globally in your system?

The error you are getting looks like the one that should be fixed by #569 .

One way to run code from repo is to cd into repo root and point python to .py file directly, like this:

python edx-dl.py -u <user> <course_url>

If this wont help, please, post the full debug output, so I could figure out what went wrong.

naefl commented 4 years ago

Hi @EugeneLoy,

Doesn't work on my end as well.

From your fork root dir:

In:

python edx-dl.py -u <name>@gmail.com https://courses.edx.org/courses/course-v1:DavidsonX+D001x+3T2018/course/

Out:

rses.edx.org/courses/course-v1:DavidsonX+D001x+3T2018/course/ --debug
root[main] edx_dl version 0.1.10
root[parse_file_formats] file_formats: ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg', 'zip', 'rar', 'gz', 'mp3', 'R', 'Rmd', 'ipynb', 'py']
Password:
root[edx_get_headers] Building initial headers for future requests.
root[_get_initial_token] Getting initial CSRF token.
Traceback (most recent call last):
  File "edx-dl.py", line 6, in <module>
    edx_dl.main()
  File "/root/workspace/edx-dl/edx_dl/edx_dl.py", line 1000, in main
    headers = edx_get_headers()
  File "/root/workspace/edx-dl/edx_dl/edx_dl.py", line 425, in edx_get_headers
    'X-CSRFToken': _get_initial_token(EDX_HOMEPAGE),
  File "/root/workspace/edx-dl/edx_dl/edx_dl.py", line 167, in _get_initial_token
    opener.open(url)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
EugeneLoy commented 4 years ago

@naefl @antoniosereno I think I know what the problem is. However, I'll need a bit more cooperation from you to make sure, since I cannot reproduce this in my environment.

I've added commit with test fix and some debug output to cummulative branch. Grab it and, please, let me know if this works for you now.

If this won't fix this issue, please post full debug output as before as well as output of the following:

curl -v https://courses.edx.org/user_api/v1/account/login_session/
antoniosereno commented 4 years ago

Thank you Eugene.. This is my output when I try to list courses:

(base) C:\edx-dl-cummulative\edx-dl-cummulative>edx-dl -u antoniosereno29@gmail.com --list-courses edx_dl version 0.1.10 Password: Building initial headers for future requests. Getting initial CSRF token. Traceback (most recent call last): File "c:\users\anton\anaconda3\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "c:\users\anton\anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\anton\Anaconda3\Scripts\edx-dl.exe\__main__.py", line 9, in <module> File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1000, in main headers = edx_get_headers() File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 425, in edx_get_headers 'X-CSRFToken': _get_initial_token(EDX_HOMEPAGE), File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 167, in _get_initial_token opener.open(url) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 531, in open response = meth(req, response) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 503, in _call_chain result = func(*args) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request

and this one is of the previous line you asked us to launch

`(base) C:\edx-dl-cummulative\edx-dl-cummulative>curl -v https://courses.edx.org/user_api/v1/account/login_session/

EugeneLoy commented 4 years ago

@antoniosereno Thanks, but from your debug output I can say for sure that edx-dl from your environment is used, as indicated by this part of stack trace:

File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 167, in _get_initial_token opener.open(url)

Please point your python directly to the edx-dl.py from repo to avoid using version that is installed in your system.

Looking at your post, command should look something like this:

C:\edx-dl-cummulative\edx-dl-cummulative>python edx-dl.py -u antoniosereno29@gmail.com --list-courses
adizukerman commented 4 years ago

@EugeneLoy , works great with https://courses.edx.org/courses/course-v1:MITx+2.830.2x+3T2019/course/ , thank you so much for the time and effort! I hope it gets integrated into the master build soon.

antoniosereno commented 4 years ago

It worked! I was able to download all the videos in the course! Thank you ! May I ask if there's a command to download not only medias (video and pdf) but also the written contents?

EugeneLoy commented 4 years ago

As far as I know if file is "attached" to course page it will be treated a resource by edx-dl and will be downloaded. At least this was my experience so far.

Sometimes, however, you have extra content that is present on the page inline (like errata, tables, extra recitations and text explanations, etc). As far as I understand this is what you interested in.

Now, it just so happens that lately I've been working on a tool that saves this kind of content :)

It is also helpful if you want to save exercises and homework (with explanations), or, any other type of content that is displayed on the course pages.

This tool is meant to complement edx-dl and is called edx-archive and can be found here: https://github.com/EugeneLoy/edx-archive

I only released it recently, so if you guys check it out that would be great!

antoniosereno commented 4 years ago

wow, I'll take a look at it! I was initially thinking of doing it manually, but it would be a long work! Thank you Eugene!

naefl commented 4 years ago

@EugeneLoy that worked, thanks for troubleshooting!

balta2ar commented 4 years ago

@EugeneLoy from your tool's page

-c, --concurrency number of pages to save in parallel (default: 4)

I don't know what's the current state of their implementation on the backend now, but my impression was that hammering edx servers is generally not a good idea. FWIW, couple of years ago they blocked me by IP for several months after me flooding their servers with requests (debugging this edx-dl, by the way). It's not that the ban could not be surmounted, but the message was clear. So if you ask me, it's more of a courtesy to not put extra pressure on them by default. If you're still not convinced, please take your time to read this thread: https://github.com/coursera-dl/edx-dl/issues/377

EugeneLoy commented 4 years ago

@balta2ar Thanks, will take my time to read though #377 , however, motivation behind adding concurrency to the tool is not to speed things up on expense of edx servers but to shave some waste time taken by page render.

The tool makes snapshot of the page once it fully rendered (including math processing) and since edx pages can be pretty bloated (I saw pages taking more than a minute to render) this leads to a lot of time being wasted waiting for render (with no network activity).

The actual workload in terms of average request rate is not high and should not cause any issues with default settings. In fact I used much higher concurrency factor and I can say that the memory is much more of a bottleneck candidate than request rate overload.

antoniosereno commented 4 years ago

Sorry for the late answer. Can you please mention the entire procedure to run the edx-archive-master? I'm not able to install it, anaconda prompt says that npm is not recognised as an internal or external command

EugeneLoy commented 4 years ago

@antoniosereno Hi.

npm is "node package manager". It is distributed along with node.

If I am not mistaken, you can get node through conda by installing nodejs package. Otherwise, you can get it from here.

Once you get npm on your system, install edx-archive:

npm install edx-archive -g

I'll update readme to clear this npm part shortly.

antoniosereno commented 4 years ago

it works perfectly @EugeneLoy ! Thanks a lot, you saved me a big amount of time!

gaber86 commented 4 years ago

still empty folders not working with https://courses.edx.org/courses/course-v1:UCSanDiegoX+DSE230x+3T2019a/course/

Navid-Alipour-96 commented 4 years ago

i have empty folders i tried the codes above but doesn't work. https://courses.edx.org/courses/course-v1:CurtinX+IOT4x+3T2019/course/

ghost commented 4 years ago

Is there a way to Download a Particular video and not the whole course...