coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 639 forks source link

Only folder structure without videos #649

Open carlosvega opened 4 years ago

carlosvega commented 4 years ago

Subject of the issue

edx-dl creates the folder structure but does not download any video

Your environment

Steps to reproduce

edx-dl -u https://courses.edx.org/courses/course-v1:KTHx+PHSC01.1x+1T2020/course/

Expected behaviour

I would expect it to download all videos

Actual behaviour

Is creating folder structure only

diptomondal007 commented 4 years ago

same here

0n0n0m0uz commented 4 years ago

edx recently changed the structure of the website and this package isn't being maintained as it was before. It's going to be up to one of us to fix it I think. Not sure if they are still devoting time to this package

RJFeddeler commented 4 years ago

I started playing with the code yesterday to see if I could get it to work. I haven't used python much so my code isn't very pretty but it's just about working. I just need to get pdf and other files downloading. the videos and subtitles work. I'm not sure if I can advertise a fork here but when its done i'll upload it. Let me know if it's ok to post here.

carlosvega commented 4 years ago

I noticed that some videos are only loaded afterwards via JS introducing an iframe. How do you workaround that?

carlosvega commented 4 years ago

Meanwhile, since I couldn't use this tool, I created my own chrome extension for that. You can find it here. https://github.com/carlosvega/edx-video-extension

RJFeddeler commented 4 years ago

I noticed that some videos are only loaded afterwards via JS introducing an iframe. How do you workaround that?

There is an api call now that gets unit IDs and other info. The section code stays the same but to get the units you make an api call for each section, it returns unit IDs for each unit, then you use the prefix https://courses.edx.org/xblock/ to get what is loaded in the iframe. I'm testing my final code now. When I upload it you can take a look.

carlosvega commented 4 years ago

Great, I think I used a similar approach for the chrome extension.

jturner421 commented 4 years ago

I'd love to see what you've done. I'm working with the code myself and got to the point of correctly identifying the urls for all subsections. Then I ran into this regex on line #92 of parsing.py:

re_units = re.compile('(<div?[^>]id="seqcontents\d+".?>.?<\/div>)', re.DOTALL)

As far as I can tell, this method and its associated regex are what is causing the script to fail to identify units.

Using the following url as an example: [(https://courses.edx.org/courses/course-v1:BerkeleyX+Data8.1x+2T2020/jump_to/block-v1:BerkeleyX+Data8.1x+2T2020+type@sequential+block@851eafb36585493aa5ce5c54f8d56d4a)]

which part do you append to [(https://courses,edx.org/xblock)]?

carlosvega commented 4 years ago

The advantage of the extension is that it can wait for JavaScript to load. The iframe src to https://courses.edx.org/xblock/block... is created dynamically via JS through very convoluted function. The server could even generate a dynamic JS file. There is one file, called something like https://learning.edx.org/vendors~app … .js that has the function that initialises the video, or some JS that loads the iframe.

I think it won't be possible to build a successful scrapper without JS rendering.

In my case I wait for the page to load, then redirect to the iframe src, then I take the $('.video.is-initialized').attr('data-metadata') and get the video URL. I can't even parse the iframe content since they use different domain for the iframe and the website. They really went far to avoid any scrapping.

From https://learning.edx.org/course/course... they change to https://courses.edx.org/xblock/block... but an id is added, that ID is what's dynamically re-calculated through a very obfuscated process.

RJFeddeler commented 4 years ago

I published my code, I modified edx_dl.py and parsing.py

https://github.com/RJFeddeler/edx-dl/

I decided to switch the way youtube-dl is used to the embedded method so I'm still playing around with the settings for that. It doesn't show progress or anything and downloads the best quality and muxes the audio/video together which is slower (ffmpeg is required for that, I forget if it defaults to the normal video file if not installed). I did that because I've been getting a lot of 500 errors when trying to download the default video+audio.

RJFeddeler commented 4 years ago

@jturner421

The xblock has both the sequential block id of the sub-section and the vertical block id of the unit. The sequential block IDs (section/subsections) are still working as usual but you use api calls to get the vertical block IDs of the units of each section.

shad90 commented 4 years ago

@RJFeddeler You code is working . I had one problem with default settings it tries to download youtube videos. It will download one video and then it will raise an exception and give error during downloading second video. I ran the script again and it downloaded the second video and raised the exception again for third video.

Something about connection time out. Also it takes long time to download video. EDIT: I tried again and it was going fine without any problem. I will update on it.

Sorry i don't have details available now. But i used --prefer-cdn-videos and it is working flawlessly. But youtube download method has the advantage of naming the files properly.

Downloading courses before edx decides to update their system again

RJFeddeler commented 4 years ago

@shad90 I'm not sure why youtube downloading isn't working for you. One thing you could try is to add the command line argument --ignore-errors (download latest version for that). I mentioned why the downloads take a long time above, its downloading the best quality audio and video separately and muxing them together. You can add the argument --format "best" and it should download from youtube quicker.

EDIT: It is --format "mp4" (or: -f "mp4")

vobisie commented 3 years ago

I published my code, I modified edx_dl.py and parsing.py

https://github.com/RJFeddeler/edx-dl/

I decided to switch the way youtube-dl is used to the embedded method so I'm still playing around with the settings for that. It doesn't show progress or anything and downloads the best quality and muxes the audio/video together which is slower (ffmpeg is required for that, I forget if it defaults to the normal video file if not installed). I did that because I've been getting a lot of 500 errors when trying to download the default video+audio.

Hey @RJFeddeler I used your repository and got the following error message

Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\iobis\Desktop\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 1112, in main all_selections = {selected_course: File "C:\Users\iobis\Desktop\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 1113, in get_available_sections(selected_course.url.replace('info', 'course'), File "C:\Users\iobis\Desktop\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 232, in get_available_sections sections = page_extractor.extract_sections_from_html(page, BASE_URL) File "C:\Users\iobis\Desktop\edx-dl-master\edx-dl-master\edx_dl\parsing.py", line 457, in extract_sections_from_html sections = [Section(position=i, File "C:\Users\iobis\Desktop\edx-dl-master\edx-dl-master\edx_dl\parsing.py", line 459, in url=_make_url(section_soup), File "C:\Users\iobis\Desktop\edx-dl-master\edx-dl-master\edx_dl\parsing.py", line 430, in _make_url return section_soup.a['href'] TypeError: 'NoneType' object is not subscriptable

RJFeddeler commented 3 years ago

@0n0n0m0uz No problem, I'm happy it worked! I am currently working on an update to make the output nicer and more useful (progress bars for everything, not just the current video) and to handle youtube errors better (downloading alternative videos)

@vobisie I didn't modify that at all, extracting the sections still works from the old code, it's getting the units from the sections that was the problem, not sure why you are getting that error.

vobisie commented 3 years ago

@RJFeddeler

@vobisie I didn't modify that at all, extracting the sections still works from the old code, it's getting the units from the sections that was the problem, not sure why you are getting that error.

(base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:MITx+JPAL102x+3T2020/course/ edx_dl version 0.1.13 Password: Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1112, in main all_selections = {selected_course: File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1113, in get_available_sections(selected_course.url.replace('info', 'course'), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 232, in get_available_sections sections = page_extractor.extract_sections_from_html(page, BASE_URL) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 457, in extract_sections_from_html sections = [Section(position=i, File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 459, in url=_make_url(section_soup), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 430, in _make_url return section_soup.a['href'] TypeError: 'NoneType' object is not subscriptable

The above is the complete output I get.

RJFeddeler commented 3 years ago

@vobisie I haven't looked at the section extraction code much. Your guess is as good as mine. You sure you have the right URL for the course? Do other courses work or same problem?

EDIT: Working on some code now and I see it verifies the URL is in your list before starting, so guess it's not a problem of a wrong URL.

Coperbytes commented 3 years ago

I am getting different errors after using your code.

`During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "c:\users\appdata\local\programs\python\python37\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\appdata\local\programs\python\python37\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\AppData\Local\Programs\Python\Python37\Scripts\edx-dl.exe__main__.py", line 7, in File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 1165, in main downloadCount = download(args, selections, filtered_units, headers) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 940, in download headers) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 898, in download_unit headers) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 875, in download_video downloadCount += skip_or_download(youtube_downloads, headers, args) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 858, in skip_or_download f(url, filename, headers, args) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 763, in download_url download_youtube_url(url, filename, headers, args) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\edx_dl\edx_dl.py", line 827, in download_youtube_url ydl.download([url]) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\youtube_dl\YoutubeDL.py", line 2019, in download url, force_generic_extractor=self.params.get('force_generic_extractor', False)) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\youtube_dl\YoutubeDL.py", line 820, in extract_info self.report_error(compat_str(e), e.format_traceback()) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\youtube_dl\YoutubeDL.py", line 625, in report_error self.trouble(error_message, tb) File "c:\users\appdata\local\programs\python\python37\lib\site-packages\youtube_dl\YoutubeDL.py", line 595, in trouble raise DownloadError(message, exc_info) youtube_dl.utils.DownloadError: ERROR: VgYkGzp_3jk: YouTube said: Unable to extract video data`

RJFeddeler commented 3 years ago

@techfre

Unable to extract video data is a problem with either youtube or youtube-dl, I get those somewhat frequently so I'm in the process of detecting those errors and downloading a different version of the video (youtube hosts multiple versions of each video with different encodings/resolutions/etc)

You can see the error in regard to the exception comes from youtube-dl so I can't do anything about it. I believe if you just use the flag -i AKA --ignore-errors then it should skip that video and continue downloading.

jturner421 commented 3 years ago

@RJFeddeler I can confirm that your code works well. Been able to download several courses. The only change I've made so far is to add downloading progress to the console. The output is not pretty, but it works.

Add bolded text to the my_hook method of the MyLogger class starting at line 115 of edx_dl.py

def my_hook(d): if d['status'] == 'error': print('Error downloading video from YouTube!') if d['status'] == 'downloading': print(d['filename'],` d['_percent_str'], d['_eta_str']) if d['status'] == 'finished': file_tuple = os.path.split(os.path.abspath(d['filename'])) print("Done downloading {}".format(file_tuple[1]))

RJFeddeler commented 3 years ago

I updated my repository with my latest version. It isn't perfect but it displays progress for the course/section/unit/video. I thought it was worth posting even though it isn't finished. It uses tqdm for progress bars. I also added an additional argument which I haven't tested:

vobisie commented 3 years ago

@RJFeddeler still having the same issues. Tried with some of my other edx courses, below is the output.

(base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:MITx+JPAL102x+3T2020/course/ edx_dl version 0.1.13 Password: Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1183, in main all_selections = {selected_course: File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1184, in get_available_sections(selected_course.url.replace('info', 'course'), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 285, in get_available_sections sections = page_extractor.extract_sections_from_html(page, BASE_URL) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 457, in extract_sections_from_html sections = [Section(position=i, File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 459, in url=_make_url(section_soup), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 430, in _make_url return section_soup.a['href'] TypeError: 'NoneType' object is not subscriptable

(base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:MITx+14.740x+3T2020/course/ edx_dl version 0.1.13 Password: Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1183, in main all_selections = {selected_course: File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1184, in get_available_sections(selected_course.url.replace('info', 'course'), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 285, in get_available_sections sections = page_extractor.extract_sections_from_html(page, BASE_URL) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 457, in extract_sections_from_html sections = [Section(position=i, File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 459, in url=_make_url(section_soup), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 430, in _make_url return section_soup.a['href'] TypeError: 'NoneType' object is not subscriptable

(base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ****@gmail.com https://courses.edx.org/courses/course-v1:MITx+15.415.1x+1T2020/course/ edx_dl version 0.1.13 Password: Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1183, in main all_selections = {selected_course: File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1184, in get_available_sections(selected_course.url.replace('info', 'course'), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 285, in get_available_sections sections = page_extractor.extract_sections_from_html(page, BASE_URL) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 457, in extract_sections_from_html sections = [Section(position=i, File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 459, in url=_make_url(section_soup), File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\parsing.py", line 430, in _make_url return section_soup.a['href'] TypeError: 'NoneType' object is not subscriptable

However, it could be an issue with MIT courses because I was able to download this course without much hassle, while I struggled previously.

(base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:edX+edx201+1T2020/course/ edx_dl version 0.1.13 Password: Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading How to Learn Online [course-v1:edX+edx201+1T2020/co] Section 1: Welcome Getting Started The edX Team Section 2: Self-Care for Learning Managing Stress Memory and Learning Take Five for Yourself (1 Question) Section 3: Space, Time and Technology Creating Space for Learning Time Management Managing Your Technology (1 Question) Section 4: Learning Strategies Self-Regulation and Learning Durable Learning (1 Question) Section 5: Communication and Community Learning Together Working Together (1 Question) Section 6: What's Next? Keep Learning

Processing units...

Removed 0 duplicated urls from 24 in total Output directory: Downloaded

Please advise and assist if possible.

weirdsourcer commented 3 years ago

I updated my repository with my latest version. It isn't perfect but it displays progress for the course/section/unit/video. I thought it was worth posting even though it isn't finished. It uses tqdm for progress bars. I also added an additional argument which I haven't tested:

  • -a (or --all): downloads all available courses sequentially. Do NOT specify any course urls with this arg, if you do, this arg is ignored.

I used edx-dl 2 months ago and it worked smoothly, I came back for it today but discovered this issues, thanks for resolving it. However, I'm a novice with github, how do I incorporate your codes into my edx-dl folder on my PC. I tried pip install --upgrade edx-dl but the output says all requirement are already satisfied but still, I can only see empty folders.

Kindly help.

RJFeddeler commented 3 years ago

@weirdsourcer I'm actually not sure where pip pulls stuff from. I can try to figure it out later but for now you just have to replace two files from my source in the edx-dl folder. The two modified files are edx-dl.py and parsing.py. I'm not sure exactly where pip installs your packages but you can type:

pip show edx-dl

to find out.

weirdsourcer commented 3 years ago

I tried it exactly according to your guide but unfortunately it still maintains it behaviour of downloading only empty folders.

I used the coding below, I'm almost certain my code is correct as that was what I used to download the Microsoft courses I took 2 months ago.

(base) C:\Users\****>edx-dl -u ********@gmail.com -p ******! -o "C:\Users\********\OneDrive\Desktop\Online Course\EDX\MITx" --cache --youtube-dl-options="-f bestvideo[height<=1080]+bestaudio/best[height<=1080]" "https://courses.edx.org/courses/course-v1:MITx+15.071x+2T2020/course/"

UPDATE: the code is working now after I remove --cache from the code which makes me wonder if you Iit will work if I work to continue my course download later as MITx release course contents every week.

Update: it stops downloading after a while with the error

Removed 3 duplicated urls from 330 in total Output directory: C:\Users******\OneDrive\Desktop\Online Course\EDX\MITx

Traceback (most recent call last):
  File "c:\users\******\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 797, in extract_info
    ie_result = ie.extract(url)
  File "c:\users\******anaconda3\lib\site-packages\youtube_dl\extractor\common.py", line 530, in extract
    ie_result = self._real_extract(url)
  File "c:\users\******\anaconda3\lib\site-packages\youtube_dl\extractor\youtube.py", line 1893, in _real_extract
    'YouTube said: %s' % unavailable_message, expected=True, video_id=video_id)
youtube_dl.utils.ExtractorError: idRDTAUV8uY: YouTube said: Unable to extract video data

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\******\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\******\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\******\anaconda3\Scripts\edx-dl.exe\__main__.py", line 7, in <module>
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1233, in main
    download(args, selections, filtered_units, headers)
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1002, in download
    download_unit(unit, args, target_dir, filename_prefix, headers)
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 937, in download_unit
    download_video(unit.videos[0], args, target_dir, filename_prefix, headers)
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 920, in download_video
    skip_or_download(youtube_downloads, FileType.Video, headers, args)
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 909, in skip_or_download
    f(url, filename, headers, args)
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 827, in download_url
    download_youtube_url(url, filename, headers, args)
  File "c:\users\******\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 880, in download_youtube_url
    ydl.download([url])
  File "c:\users\w******\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 2019, in download
    url, force_generic_extractor=self.params.get('force_generic_extractor', False))
  File "c:\users\******\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 820, in extract_info
    self.report_error(compat_str(e), e.format_traceback())
  File "c:\users\******\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 625, in report_error
    self.trouble(error_message, tb)
  File "c:\users\******\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 595, in trouble
    raise DownloadError(message, exc_info)
youtube_dl.utils.DownloadError: ERROR: idRDTAUV8uY: YouTube said: Unable to extract video data
txeni commented 3 years ago

@RJFeddeler You are the man, thanks so much!

Just in case someone else runs into the same problem I did, it seems for a course I was downloading the separated files and then merging was giving me problems. If anyone has a similar problem, with errors from youtubedl or ffmpeg, try the -f "mp4" argument. It solved the problem for me. Before that I was getting the following error:

Traceback (most recent call last):                                                                                                 | 0/6 [?]
  File "/home/carlos/anaconda2/lib/python3.7/site-packages/youtube_dl/YoutubeDL.py", line 2065, in post_process                | 3/4 [00:03]
    files_to_delete, info = pp.run(info)                                                                                                    
  File "/home/carlos/anaconda2/lib/python3.7/site-packages/youtube_dl/postprocessor/ffmpeg.py", line 523, in run
    self.run_ffmpeg_multiple_files(info['__files_to_merge'], temp_filename, args)
  File "/home/carlos/anaconda2/lib/python3.7/site-packages/youtube_dl/postprocessor/ffmpeg.py", line 235, in run_ffmpeg_multiple_files
    raise FFmpegPostProcessorError(msg)
youtube_dl.postprocessor.ffmpeg.FFmpegPostProcessorError: Could not write header for output file #0 (incorrect codec parameters ?): Invalid argument
RJFeddeler commented 3 years ago

@weirdsourcer --cache isn't very well implemented (by the original authors, I didn't touch it) and it really only saves you like a minute in time. You can resume downloading a course each week just by downloading the course without the --cache argument. It is supposed to skip files already downloaded, which it does for everything but youtube downloads which I'm currently working on fixing. The original code relies on youtube-dl to skip the youtube download which works okay but wastes some time.

As far as the error you get now where it stops downloading, that is something thats always happened for me. Thats why I always use the --ignore-errors (or just -i) arguments. At least then when it encounters that error it will keep going. As for a better solution, I am having it download alternate versions of the videos if one fails like that. I haven't tested it but it should be working, i'll publish the new code soon. You need to use the --ignore-errors argument though or any error youtube-dl encounters is just gonna end the program.

@txeni Yea I've had it default to downloading the best quality audio and video separately and using ffmpeg to combine them but I think it makes more sense to use the standard mp4 format argument as the default. I was just having trouble with the original code with the same error @weirdsourcer was having so I was looking for a format source that was more reliable but none are. I'll decide on a format order to go through when errors are encountered so that it doesn't just skip the file, but --ignore-errors must be specified.

nalam002 commented 3 years ago

@RJFeddeler Thanks for your new code, however I always keep getting a traceback that seems different from the ones reported so far, even if I'm typing nothing but edx-dl without any arguments at all.

Traceback (most recent call last):
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\teacher.DESKTOP-6R84B69\miniconda3\Scripts\edx-dl.exe\__main__.py", line 4, in <module>
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\edx_dl\edx_dl.py", line 24, in <module>
    import youtube_dl
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\youtube_dl\__init__.py", line 15, in <module>
    from .options import (
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\youtube_dl\options.py", line 8, in <module>
    from .downloader.external import list_external_downloaders
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\youtube_dl\downloader\__init__.py", line 5, in <module>
    from .hls import HlsFD
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\youtube_dl\downloader\hls.py", line 6, in <module>
    from Crypto.Cipher import AES
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\Crypto\Cipher\__init__.py", line 27, in <module>
    from Crypto.Cipher._mode_ecb import _create_ecb_cipher
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\Crypto\Cipher\_mode_ecb.py", line 35, in <module>
    raw_ecb_lib = load_pycryptodome_raw_lib("Crypto.Cipher._raw_ecb", """
  File "c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\Crypto\Util\_raw_api.py", line 308, in load_pycryptodome_raw_lib
    raise OSError("Cannot load native module '%s': %s" % (name, ", ".join(attempts)))
OSError: Cannot load native module 'Crypto.Cipher._raw_ecb': Trying '_raw_ecb.cp38-win_amd64.pyd': cannot load library 'c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\Crypto\Util\..\Cipher\_raw_ecb.cp38-win_amd64.pyd': error 0x7e.  Additionally, ctypes.util.find_library() did not manage to locate a library called 'c:\\users\\teacher.desktop-6r84b69\\miniconda3\\lib\\site-packages\\Crypto\\Util\\..\\Cipher\\_raw_ecb.cp38-win_amd64.pyd', Trying '_raw_ecb.pyd': cannot load library 'c:\users\teacher.desktop-6r84b69\miniconda3\lib\site-packages\Crypto\Util\..\Cipher\_raw_ecb.pyd': error 0x7e.  Additionally, ctypes.util.find_library() did not manage to locate a library called 'c:\\users\\teacher.desktop-6r84b69\\miniconda3\\lib\\site-packages\\Crypto\\Util\\..\\Cipher\\_raw_ecb.pyd'

I upgraded youtube-dl and Crypto packages just to be safe, but nothing changed. :( BTW I'm running python 3.8 if that means anything.

EDIT: Nvm, I reinstalled python (latest miniconda) and redid everything, and now the error is gone.

weirdsourcer commented 3 years ago

@RJFeddeler thanks for your effort, is it possible with the current with your latest commit to download a list of courses with just one request. I noticed the progress bar counts course as 0/1, what is the command for downloading say a whole programme with 3 courses in it at the same time, or should I say 3 courses at the same time without sending the request per courses. This should result in course count like 0/3, 1/3, 2/3, you get the Idea.

PencilWarrior1 commented 3 years ago

@RJFeddeler I installed your version, but when I run edx-dl I'm getting this error... any ideas? :-)

"C:\Users*****\AppData\Local\Programs\Python\Python38-32\lib\site-packages\edx_dl-0.1.13-py3.8.egg\edx_dl\edx_dl.py", line 27, in ModuleNotFoundError: No module named 'tqdm'

ugur1yildiz commented 3 years ago

@PencilWarrior1 Install 'tqdm' as

pip install tqdm

okyere commented 3 years ago

I updated my repository with my latest version. It isn't perfect but it displays progress for the course/section/unit/video. I thought it was worth posting even though it isn't finished. It uses tqdm for progress bars. I also added an additional argument which I haven't tested:

  • -a (or --all): downloads all available courses sequentially. Do NOT specify any course urls with this arg, if you do, this arg is ignored.

Great work. Thanks for sharing.

diamneth commented 3 years ago

@RJFeddeler Hello, appreciate all your work and effort in this. Have you maybe found a solution for the Youtube unable to extract video data error? I am getting the same like some guys above.

Traceback (most recent call last): File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 797, in extract_info ie_result = ie.extract(url) File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\extractor\common.py", line 532, in extract ie_result = self._real_extract(url) File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\extractor\youtube.py", line 1909, in _real_extract raise ExtractorError( youtube_dl.utils.ExtractorError: MjTmGAJCviA: YouTube said: Unable to extract video data

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "c:\users\r\anaconda3\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\users\r\anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\r\anaconda3\Scripts\edx-dl.exe__main__.py", line 7, in File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1236, in main download(args, selections, filtered_units, headers) File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1004, in download download_unit(unit, args, target_dir, filename_prefix, headers) File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 939, in download_unit download_video(unit.videos[0], args, target_dir, filename_prefix, headers) File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 922, in download_video skip_or_download(youtube_downloads, FileType.Video, headers, args) File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 911, in skip_or_download f(url, filename, headers, args) File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 829, in download_url download_youtube_url(url, filename, headers, args) File "c:\users\r\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 882, in download_youtube_url ydl.download([url]) File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 2018, in download res = self.extract_info( File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 820, in extract_info self.report_error(compat_str(e), e.format_traceback()) File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 625, in report_error self.trouble(error_message, tb) File "c:\users\r\anaconda3\lib\site-packages\youtube_dl\YoutubeDL.py", line 595, in trouble raise DownloadError(message, exc_info) youtube_dl.utils.DownloadError: ERROR: MjTmGAJCviA: YouTube said: Unable to extract video data

vobisie commented 3 years ago

Does anyone have any potential solutions to resolve this issue?

(base) C:\Users*\Desktop\edx-dl-master>python edx-dl.py -u **@gmail.com https://courses.edx.org/courses/course-v1:MITx+14.740x+3T2020/course/ edx_dl version 0.1.13 Password: Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1183, in main all_selections = {selected_course: File "C:\Users*\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1184, in get_available_sections(selected_course.url.replace('info', 'course'), File "C:\Users*\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 285, in get_available_sections sections = page_extractor.extract_sections_from_html(page, BASE_URL) File "C:\Users*\Desktop\edx-dl-master\edx_dl\parsing.py", line 457, in extract_sections_from_html sections = [Section(position=i, File "C:\Users*\Desktop\edx-dl-master\edx_dl\parsing.py", line 459, in url=_make_url(section_soup), File "C:\Users***\Desktop\edx-dl-master\edx_dl\parsing.py", line 430, in _make_url return section_soup.a['href'] TypeError: 'NoneType' object is not subscriptable

RJFeddeler commented 3 years ago

@weirdsourcer you can list multiple course urls or you can list no course urls and use the -a or --all flag to download all available courses.

@diamneth use the --ignore-errors flag, my latest code will attempt to download it again in a different format and if that fails it will at least continue download the rest of the videos. I'm guessing that error is a problem with youtube-dl, I've always gotten that error randomly.

sorin71 commented 3 years ago

@vobisie I had the same problem. I fixed it in parsing.py by changing the line 431 from: except AttributeError: to except:

vobisie commented 3 years ago

Thank you @sorin71 . Do you have any idea how to fix the issue of there being no sound with the videos downloaded? Thank you

vobisie commented 3 years ago

Also, does the take down of youtube-dl impact edx-dl? Can this work with youtube-dlc? If so how?

Thank you

sorin71 commented 3 years ago

The take down of youtube-dl will have an impact on edx-dl, but probably on longer term when youtube will make format changes. youtube-dlc might end up in being taken down as well as it seems to be a fork of youtube-dl.

The problem with no sound for the downloaded videos is a false one. The video (mp4) and the audio (m4a) are in separate files, and you have to combine them in a single file using a tool like ffmpeg.

jmfontana commented 3 years ago

Same Error 403 problem:

File "/Users/blahuser/.pyenv/versions/3.8.3/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

MissGorgeousTech commented 3 years ago

it is working great what I did was pip uninstall edx-dl (the original) upgraded youtube-dl I also have the python version 3.8 then download your zip code...unzipped from cmd changed the directory to the unzipped folder then run python edx-dl.py -u xxxxx@xxxxx.com --list-courses [if the error ModuleNotFoundError: No module named 'tqdm' ---you do : pip install tqdm and then try again] then choose the URL from the course

note the something I noted is sometimes some videos are separated from track audio but nevertheless works great

vobisie commented 3 years ago

@sorin71 & @RJFeddeler do you have any idea why I get this output for python edx-dl -u *** -a -i

Traceback (most recent call last): File "edx-dl.py", line 8, in edx_dl.main() File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 1213, in main all_units = extractor(all_urls, headers, file_formats) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 590, in extract_all_units_in_parallel units = pool.map(mapfunc, urls) File "C:\Users\iobis\anaconda3\lib\multiprocessing\pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "C:\Users\iobis\anaconda3\lib\multiprocessing\pool.py", line 771, in get raise self._value File "C:\Users\iobis\anaconda3\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "C:\Users\iobis\anaconda3\lib\multiprocessing\pool.py", line 48, in mapstar return list(map(args)) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\edx_dl.py", line 559, in extract_units unit_page = get_page_contents(unit_url, headers) File "C:\Users\iobis\Desktop\edx-dl-master\edx_dl\utils.py", line 58, in get_page_contents result = urlopen(Request(url, None, headers)) File "C:\Users\iobis\anaconda3\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Users\iobis\anaconda3\lib\urllib\request.py", line 531, in open response = meth(req, response) File "C:\Users\iobis\anaconda3\lib\urllib\request.py", line 640, in http_response response = self.parent.error( File "C:\Users\iobis\anaconda3\lib\urllib\request.py", line 569, in error return self._call_chain(args) File "C:\Users\iobis\anaconda3\lib\urllib\request.py", line 502, in _call_chain result = func(args) File "C:\Users\iobis\anaconda3\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 500: Internal Server Error

jmfontana commented 3 years ago

it is working great what I did was pip uninstall edx-dl (the original) upgraded youtube-dl I also have the python version 3.8 then download your zip code...unzipped from cmd changed the directory to the unzipped folder then run python edx-dl.py -u xxxxx@xxxxx.com --list-courses [if the error ModuleNotFoundError: No module named 'tqdm' ---you do : pip install tqdm and then try again] then choose the URL from the course

note the something I noted is sometimes some videos are separated from track audio but nevertheless works great

This is it! Yes, this worked for me. I had tried everything else but until I did 'pip uninstall edx-dl' nothing worked. Thanks!

JM

MagTun commented 3 years ago

@jmfontana and @MissGorgeousTech, what do you mean by "zip code" in:

then download your zip code...unzipped

Is it possible to have the link? Thanks !

MissGorgeousTech commented 3 years ago

@jmfontana and @MissGorgeousTech, what do you mean by "zip code" in:

then download your zip code...unzipped

Is it possible to have the link? Thanks!

Hi. I refer to download the zipped code...search for a green button that says Code, click on it and you will see Download Zip...and click it...and you go from there. If you still have any difficulties feel free to tell me. I will try with screenshots.

MagTun commented 3 years ago

Thanks for your help @MissGorgeousTech ! I found the green button on the home page but I am still getting empty folders.

4 days ago, I was able to get some videos by following the @RJFeddeler code but at some point in the downloading I got an error RuntimeError: cannot join current thread. When I tried again, the script stays on "Processing" for hours. The first time I got 13 videos, the second time I tried again from scratch, and I got 21 videos (I guess there are over a 100 video in my course: I just got the review videos, the download didn't even reach week 1 of a 5 weeks course).

I am on python 3.6 (can't update yet to 3.8)

weirdsourcer commented 3 years ago

Could youtube-dl takedown be the culprit for the following error? is there a way to resolve it in case it is? I'm on python 3.8

Traceback (most recent call last):
  File "c:\users\user\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\user\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\USER\anaconda3\Scripts\edx-dl.exe\__main__.py", line 7, in <module>
  File "c:\users\user\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1213, in main
    all_units = extractor(all_urls, headers, file_formats)
  File "c:\users\user\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 590, in extract_all_units_in_parallel
    units = pool.map(mapfunc, urls)
  File "c:\users\user\anaconda3\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "c:\users\user\anaconda3\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
  File "c:\users\user\anaconda3\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\user\anaconda3\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "c:\users\user\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 553, in extract_units
    page = get_page_contents(url, headers)
  File "c:\users\user\anaconda3\lib\site-packages\edx_dl\utils.py", line 58, in get_page_contents
    result = urlopen(Request(url, None, headers))
  File "c:\users\user\anaconda3\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "c:\users\user\anaconda3\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
  File "c:\users\user\anaconda3\lib\urllib\request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "c:\users\user\anaconda3\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "c:\users\user\anaconda3\lib\urllib\request.py", line 1393, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "c:\users\user\anaconda3\lib\urllib\request.py", line 1354, in do_open
    r = h.getresponse()
  File "c:\users\user\anaconda3\lib\http\client.py", line 1332, in getresponse
    response.begin()
  File "c:\users\user\anaconda3\lib\http\client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "c:\users\user\anaconda3\lib\http\client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
MagTun commented 3 years ago

After having seen that youtube-dl was updated, I tried again yesterday, and I was able to get all the videos from my courses. First I updated youtube-dl with: python -m pip install --upgrade youtube-dl The I made sure that edx-dl was also up to date: python -m pip install --upgrade edx-dl Then I replaced the files edx-dl.py and parsing.py according to @RJFeddeler comment

Thanks for your help!

Oscarhg42 commented 3 years ago

@RJFeddeler you rock!! Thanks for sharing your repository!

MATRIX30 commented 3 years ago

🚨Please review the Troubleshooting section before reporting any issue. Don't forget also to check the current issues to avoid duplicates.

Subject of the issue

after modifying my edx.py and parsing.py as prescribed by @RJFeddeler, I still get this error can someone figure out whats wrong?

Your environment

Steps to reproduce

edx-dl -u email -p password --ignore-errors --cache https://courses.edx.org/courses/course-v1:USMx+ENCE607.1x+3T2019/course/

Expected behaviour

download should have started normally

Actual behaviour

I get this Error message

Building initial headers for future requests. Getting initial CSRF token. Found CSRF token. Logging into Open edX site: https://courses.edx.org/login_ajax Extracting course information from dashboard. Downloading Applied Scrum for Agile Project Management [course-v1:USMx+ENCE607.1x+3T2019/co] Section 1: Welcome! Welcome to Applied Scrum Getting Started with Goals! Section 2: Week 1: Why Agile? 1.0 Introduction to Week 1 1.1 Agile Basics 1.2 Proof Agile Works 1.3 Evolution of Agile 1.4 Netflix Case Study 1.5 18F Case Study 1.6 Week 1 Quiz 1.7 Week 1 Takeaways & Feedback Verify Your Knowledge and Skills! Section 3: Week 2: Who Uses Agile? 2.0 Introduction to Week 2 2.1 Simple PM Methods 2.2 Approaching the Triple Cost Constraint 2.3 Comparing Methods Across Industries 2.4 Comparing Methods of Customer Management 2.5 Comparing Methods of Engineering Management 2.6 Week 2 Quiz 2.7 Week 2 Takeaways & Feedback Verify Your Knowledge and Skills! Section 4: Week 3: How to Scrum And Be Agile? 3.0 Introduction to How to Scrum and Be Agile? 3.1 Scrum Team Formation 3.2 Three-Part User Story 3.3 Sprint Planning 3.4 Sprint Development 3.5 Sprint Retro & Review 3.6 Week 3 Quiz 3.7 Week 3 Takeaways & Feedback Verify Your Knowledge and Skills! Section 5: Week 4: What Scrum Framework Fits Best? 4.0 Introduction to What Scrum Framework Fits Best? 4.1 Scrum in the World of Agile 4.2 Exploring the Scaled Agile Framework (SAFe) 4.3 Exploring Disciplined Agile Delivery (DAD) 4.4 Exploring Large Scale Scrum (LeSS) 4.5 Pitfalls and Benefits of Agile at Scale 4.6 Week 4 Quiz 4.7 Week 4 Takeaways & Feedback Verify Your Knowledge and Skills! Section 6: Course Final for Verified Students Course Final for Verified Students Section 7: Congratulations! Now Keep Going! Thank You! Now Will You Continue? Feedback Quiz Processing units...

Removed 0 duplicated urls from 76 in total

edx_dl version 0.1.13 loading 3212 urls from cache [edx-dl.cache] Traceback (most recent call last): File "c:\users\cyanide systems\appdata\local\programs\python\python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\users\cyanide systems\appdata\local\programs\python\python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\Cyanide Systems\AppData\Local\Programs\Python\Python39\Scripts\edx-dl.exe__main__.py", line 7, in File "c:\users\cyanide systems\appdata\local\programs\python\python39\lib\site-packages\edx_dl\edx_dl.py", line 1233, in main download(args, selections, filtered_units, headers) File "c:\users\cyanide systems\appdata\local\programs\python\python39\lib\site-packages\edx_dl\edx_dl.py", line 989, in download coursename = directory_name(selected_course.name) File "c:\users\cyanide systems\appdata\local\programs\python\python39\lib\site-packages\edx_dl\utils.py", line 49, in directory_name result = clean_filename(initial_name) File "c:\users\cyanide systems\appdata\local\programs\python\python39\lib\site-packages\edx_dl\utils.py", line 123, in clean_filename s = h.unescape(s) AttributeError: 'HTMLParser' object has no attribute 'unescape'