coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.92k stars 638 forks source link

Empty folders still a problem with multiple courses #591

Closed abeckman closed 4 years ago

abeckman commented 4 years ago

Subject of the issue

Several courses either show no downloadable video or only download video from sections where no other video exists (some landing pages have mixed video and text and others have only a single video).

Your environment

Steps to reproduce

See below

Expected behaviour

Should download many videos or print messages about skipping ones already downloaded (both courses I'm attempting to download got content for the first few weeks of the courses) content.

Actual behaviour

One course gets all empty folders. $ edx-dl -u ** -p ** https://courses.edx.org/courses/course-v1:MITx+7.28.1x+1T2020/course/ --ignore-error edx_dl version 0.1.12 Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading Molecular Biology - Part 1: DNA Replication and Repair [course-v1:MITx+7.28.1x+1T2020/co] Downloading 0 section(s) Extracting all units information in parallel. No downloadable video found.

In the one below it finds one video previously downloaded in week 0 and downloads one from week 6. I had previously downloaded through week 2, so it is missing weeks 3-5 videos. In the other course I tested with, I had again downloaded week 1, but it indicated no videos at all for week 1 or subsequent weeks.

edx-dl -u **** -p ** https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/course/ --ignore-error edx_dl version 0.1.12 Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading Hello (Real) World with ROS – Robot Operating System [course-v1:DelftX+ROS1x+1T2020/co] Downloading 2 section(s) Section 1: Welcome Welcome Pre-survey Course conventions Course Setup Section 2: Course wrap-up Course wrap-up Post-survey Acknowledgements Extracting all units information in parallel. Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@659887fafb8847359a9e9287825cbd7a' Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@49488c09c7274ee2a3c2ec3cb3eac1c7' Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@9fb4681af77d4b229a2470bed01c7f6c' Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@1e69cfa733964152b8d34232dedd7d2d' Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@304850235f024aa6a383239688cd190f' Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@3591fd3227c1473b9ed6f721b26e9340' Processing 'https://courses.edx.org/courses/course-v1:DelftX+ROS1x+1T2020/jump_to/block-v1:DelftX+ROS1x+1T2020+type@sequential+block@fba3a9c54c59409f8a38cac9c7c4f9ba' Removed 0 duplicated urls from 8 in total Output directory: Downloaded [download] https://youtube.com/watch?v=RoIRFnDLj3c => Downloaded/Hello_Real_World_with_ROSRobot_Operating_System/01-Welcome/01-%(title)s-%(id)s.%(ext)s Downloading video with URL https://youtube.com/watch?v=RoIRFnDLj3c from YouTube. [youtube] RoIRFnDLj3c: Downloading webpage [youtube] RoIRFnDLj3c: Downloading video info webpage [youtube] RoIRFnDLj3c: Downloading MPD manifest [download] Downloaded/Hello_Real_World_with_ROSRobot_Operating_System/01-Welcome/01-ROS1x_2020_Week_0_Overview_course-video-RoIRFnDLj3c.mp4 has already been downloaded [download] 100% of 18.82MiB [download] https://youtube.com/watch?v=8tG0ZEIMgvc => Downloaded/Hello_Real_World_with_ROS__Robot_Operating_System/02-Course_wrap-up/01-%(title)s-%(id)s.%(ext)s Downloading video with URL https://youtube.com/watch?v=8tG0ZEIMgvc from YouTube. [youtube] 8tG0ZEIMgvc: Downloading webpage [youtube] 8tG0ZEIMgvc: Downloading video info webpage [youtube] 8tG0ZEIMgvc: Downloading MPD manifest [download] Downloaded/Hello_Real_World_with_ROS__Robot_Operating_System/02-Course_wrap-up/01-ROS1x_2018_Week_6_Acknowledgements-video-8tG0ZEIMgvc.mp4 has already been downloaded [download] 100% of 22.93MiB

ichit commented 4 years ago

hello, i done all things suggested on this platform and nothing works, below are what i always get . I have tried it with python 2.7 to python 3.8 nothing works.

(python27) PS C:\Users*\edx-dl-0.1.12> python edx-dl.py edx-dl -u *** --list-courses edx_dl version 0.1.12 Password:* Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. You can access 45 courses 1 - IELTS Academic Test Preparation [course-v1:UQx+IELTSx+1T2020/co] https://courses.edx.org/courses/course-v1:UQx+IELTSx+1T2020/course/ 2 - Probability - The Science of Uncertainty and Data [course-v1:MITx+6.431x+1T2020/co] https://courses.edx.org/courses/course-v1:MITx+6.431x+1T2020/course/ 3 - Data Science: Probability [course-v1:HarvardX+PH125.3x+1T2020/co] https://courses.edx.org/courses/course-v1:HarvardX+PH125.3x+1T2020/course/ 4 - DemoX [course-v1:edX+DemoX.1+2T2019/co] https://courses.edx.org/courses/course-v1:edX+DemoX.1+2T2019/course/ 5 - Solar Energy: Photovoltaic (PV) Energy Conversion [course-v1:DelftX+PV1x+1T2020/co] https://courses.edx.org/courses/course-v1:DelftX+PV1x+1T2020/course/ 6 - Signals and Systems, Part 1 [course-v1:IITBombayX+EE210.1x+1T2018a/co] https://courses.edx.org/courses/course-v1:IITBombayX+EE210.1x+1T2018a/course/ 7 - Signals and Systems, Part 2 [course-v1:IITBombayX+EE210.2x+1T2018/co] https://courses.edx.org/courses/course-v1:IITBombayX+EE210.2x+1T2018/course/ 8 - Principle of Semiconductor Devices Part I: Semiconductors, PN Junctions and Bipolar Junction Transistors [course-v1:HKUSTx+ELEC3500.1x+1T2020/co] https://courses.edx.org/courses/course-v1:HKUSTx+ELEC3500.1x+1T2020/course/ 9 - Principles of Electronic Biosensors [course-v1:PurdueX+nano535x+2016_T1/co] https://courses.edx.org/courses/course-v1:PurdueX+nano535x+2016_T1/course/ 10 - Introduction to Urban Geo-Informatics [course-v1:HKPolyUx+LSGI1001x+1T2018/co] https://courses.edx.org/courses/course-v1:HKPolyUx+LSGI1001x+1T2018/course/ 11 - MATLAB and Octave for Beginners [course-v1:EPFLx+MatlabeOctaveBeginnersX+1T2017/co] https://courses.edx.org/courses/course-v1:EPFLx+MatlabeOctaveBeginnersX+1T2017/course/ 12 - Discrete Time Signals and Systems, Part 1: Time Domain [RiceX/ELEC301x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301x/2015Q3/course/ 13 - Discrete Time Signals and Systems, Part 2: Frequency Domain [RiceX/ELEC301.2x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301.2x/2015Q3/course/ 14 - Discrete-Time Signal Processing [course-v1:MITx+6.341x_2+2T2016/co] https://courses.edx.org/courses/course-v1:MITx+6.341x_2+2T2016/course/ 15 - C Programming: Using Linux Tools and Libraries [course-v1:Dartmouth_IMTx+DART.IMT.C.07+2T2018/co] https://courses.edx.org/courses/course-v1:Dartmouth_IMTx+DART.IMT.C.07+2T2018/course/ 16 - Introduction to Linux [course-v1:LinuxFoundationX+LFS101x+1T2020/co] https://courses.edx.org/courses/course-v1:LinuxFoundationX+LFS101x+1T2020/course/ 17 - Introduction to Probability [course-v1:HarvardX+STAT110x+1T2020/co] https://courses.edx.org/courses/course-v1:HarvardX+STAT110x+1T2020/course/ 18 - CS50's Introduction to Computer Science [course-v1:HarvardX+CS50+X/co] https://courses.edx.org/courses/course-v1:HarvardX+CS50+X/course/ 19 - Python for Data Science [course-v1:UCSanDiegoX+DSE200x+3T2019a/co] https://courses.edx.org/courses/course-v1:UCSanDiegoX+DSE200x+3T2019a/course/ 20 - Introduction to Data Science [course-v1:Microsoft+DAT101x+1T2020/co] https://courses.edx.org/courses/course-v1:Microsoft+DAT101x+1T2020/course/ 21 - Introduction to Data Science [course-v1:IBM+DS0101EN+1T2020/co] https://courses.edx.org/courses/course-v1:IBM+DS0101EN+1T2020/course/ 22 - Data Science: R Basics [course-v1:HarvardX+PH125.1x+1T2020/co] https://courses.edx.org/courses/course-v1:HarvardX+PH125.1x+1T2020/course/ 23 - Applications of Quantum Mechanics [course-v1:MITx+8.06x+1T2019/co] https://courses.edx.org/courses/course-v1:MITx+8.06x+1T2019/course/ 24 - Quantum Information Science I, Part 3 [course-v1:MITx+8.370.3x+1T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.370.3x+1T2018/course/ 25 - Quantum Information Science I, Part 1 [course-v1:MITx+8.370.1x+1T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.370.1x+1T2018/course/ 26 - Quantum Information Science I, Part 2 [course-v1:MITx+8.370.2x+1T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.370.2x+1T2018/course/ 27 - Quantum Information Science II, Part 1 - Quantum states, noise and error correction [course-v1:MITx+8.371.1x+2T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.371.1x+2T2018/course/ 28 - Quantum Mechanics: Quantum physics in 1D potentials [course-v1:MITx+8.04.2x+3T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.04.2x+3T2018/course/ 29 - Quantum Information Science II, Part 2 - Efficient Quantum Computing - fault tolerance and complexity [course-v1:MITx+8.371.2x+2T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.371.2x+2T2018/course/ 30 - Calculus 1A: Differentiation [course-v1:MITx+18.01.1x+2T2019/co] https://courses.edx.org/courses/course-v1:MITx+18.01.1x+2T2019/course/ 31 - Introduction to Differential Equations [course-v1:MITx+18.031x+2T2019/co] https://courses.edx.org/courses/course-v1:MITx+18.031x+2T2019/course/ 32 - Mechanics: Kinematics and Dynamics [course-v1:MITx+8.01.1x+3T2019/co] https://courses.edx.org/courses/course-v1:MITx+8.01.1x+3T2019/course/ 33 - Mechanics: Momentum and Energy [course-v1:MITx+8.01.2x+3T2019a/co] https://courses.edx.org/courses/course-v1:MITx+8.01.2x+3T2019a/course/ 34 - Mastering Quantum Mechanics Part 3: Entanglement and Angular Momentum [course-v1:MITx+8.05.3x+2T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.05.3x+2T2018/course/ 35 - Mastering Quantum Mechanics Part 1: Wave Mechanics [course-v1:MITx+8.05.1x+1T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.05.1x+1T2018/course/ 36 - Mastering Quantum Mechanics Part 2: Quantum Dynamics [course-v1:MITx+8.05.2x+1T2018/co] https://courses.edx.org/courses/course-v1:MITx+8.05.2x+1T2018/course/ 37 - Calculus 1B: Integration [course-v1:MITx+18.01.2x+3T2019/co] https://courses.edx.org/courses/course-v1:MITx+18.01.2x+3T2019/course/ 38 - Differential Equations: 2x2 Systems [course-v1:MITx+18.032x+1T2020/co] https://courses.edx.org/courses/course-v1:MITx+18.032x+1T2020/course/ 39 - Machine Learning with Python-From Linear Models to Deep Learning [course-v1:MITx+6.86x+1T2020/co] https://courses.edx.org/courses/course-v1:MITx+6.86x+1T2020/course/ 40 - Computing in Python III: Data Structures [course-v1:GTx+CS1301xIII+3T2019/co] https://courses.edx.org/courses/course-v1:GTx+CS1301xIII+3T2019/course/ 41 - Computing in Python IV: Objects & Algorithms [course-v1:GTx+CS1301xIV+3T2019/co] https://courses.edx.org/courses/course-v1:GTx+CS1301xIV+3T2019/course/ 42 - Programming for Everybody (Getting Started with Python) [course-v1:MichiganX+py4e101x+3T2019/co] https://courses.edx.org/courses/course-v1:MichiganX+py4e101x+3T2019/course/ 43 - Python Data Structures [course-v1:MichiganX+py4e102x+3T2019/co] https://courses.edx.org/courses/course-v1:MichiganX+py4e102x+3T2019/course/ 44 - Using Python for Research [course-v1:HarvardX+PH526x+1T2020/co] https://courses.edx.org/courses/course-v1:HarvardX+PH526x+1T2020/course/ 45 - Introduction to Computer Science and Programming Using Python [course-v1:MITx+6.00.1x+1T2020/co] https://courses.edx.org/courses/course-v1:MITx+6.00.1x+1T2020/course/ (python27) PS C:\Users***\edx-dl-0.1.12> python edx-dl.py edx-dl -u ***** -p **** https://courses.edx.org/courses/RiceX/ELEC301x_/2015Q3/course/ edx_dl version 0.1.12 Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Error, cannot login: HTTP Error 400: Bad Request Wrong Email or Password. ** : The term '**' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:55

* : The term '' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:65

(python27) PS C:\Users*****\edx-dl-0.1.12>

ichit commented 4 years ago

I please beg anyone who can help me download these following courses as to help me in my thesis, i am loosing time. These are the most important courses i desperately needed: 12 - Discrete Time Signals and Systems, Part 1: Time Domain [RiceX/ELEC301x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301x/2015Q3/course/

13 - Discrete Time Signals and Systems, Part 2: Frequency Domain [RiceX/ELEC301.2x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301.2x/2015Q3/course/ Thanks in advance

Oshibuki commented 4 years ago

I please beg anyone who can help me download these following courses as to help me in my thesis, i am loosing time. These are the most important courses i desperately needed: 12 - Discrete Time Signals and Systems, Part 1: Time Domain [RiceX/ELEC301x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301x/2015Q3/course/

13 - Discrete Time Signals and Systems, Part 2: Frequency Domain [RiceX/ELEC301.2x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301.2x/2015Q3/course/ Thanks in advance

In powershell ,you should add quote around your course url. like this: python3 edx-dl.py -u your-account "your course url" In fact I don't recommend you use powershell, because it doesn't recognize special symbols well in strings . You should use cmd or bash. Then, please edit your first comment or modify your password quickly: The term '**' is not recognized as the name of a cmdlet, function, script file, or operable. There is your password.

Oshibuki commented 4 years ago

The issue is because parsing.py use wrong class selector: sections_soup = soup.find_all('li', class_=['outline-item section']) So it could only get part of sections, any intermediate sections with class "outline-item section scored" will be ignored.

image

I have fixed it in this branch: https://github.com/tanjiarui15/edx-dl/tree/fix-parsing-for-edx-multiple-sections

ichit commented 4 years ago

I please beg anyone who can help me download these following courses as to help me in my thesis, i am loosing time. These are the most important courses i desperately needed: 12 - Discrete Time Signals and Systems, Part 1: Time Domain [RiceX/ELEC301x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301x/2015Q3/course/ 13 - Discrete Time Signals and Systems, Part 2: Frequency Domain [RiceX/ELEC301.2x/2015Q3/co] https://courses.edx.org/courses/RiceX/ELEC301.2x/2015Q3/course/ Thanks in advance

In powershell ,you should add quote around your course url. like this: python3 edx-dl.py -u your-account "your course url" In fact I don't recommend you use powershell, because it doesn't recognize special symbols well in strings . You should use cmd or bash. Then, please edit your first comment or modify your password quickly: The term '**' is not recognized as the name of a cmdlet, function, script file, or operable. There is your password.

thanks

abeckman commented 4 years ago

@tanjiarui15 - I've tested your changes to parsing.py against two courses that weren't downloading all or most videos and it is now working. Very much appreciate your efforts!