coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 640 forks source link

[Feature Proposal] Download all content #600

Open RayBB opened 4 years ago

RayBB commented 4 years ago

[Feature Request] Download all content

I am making a new issue to have complete ticket to point to that will easily be found when people are wondering why they can't download: quizzes, notes, assignment, assessments, hangouts sheets, knowledge checks, questions, or html content in general.

It seems that everyone is all these comments are on the same page. Downloading content besides videos/pdf is very valuable. However, it's tricky to implement. It seems to be part of that trickieness is due to the variety of course structures (see #102). However, that may have improved since 2014.

These OPEN tickets are related:

102 -

253 - ask to download pages with embedded html

283 - has a lot of discussion of download hierarchy

337 - using edx-platform-api (seems unlikely)

429 - includes a patch that used to work

447 -

524 -

550 -

561 -

596 -

I'd recommend closing some of those tickets since they'll all see the link to this new ticket.

Temporary Fix

As a stopgap measure I've written a very small js script folks can run in the browser to save content they want. https://github.com/RayBB/edx-scrape

Implementation

To get this setup in edx-dl it looks like we'll need to determine a few things.

  1. What to download
  2. Directory structure
  3. Usage

What to Download

For a very naive solution, we could start by making something similar to the script I added above. That would mean going to to the "progress" page, grabbing all of the links there, and then downloading all of those pages.

That will give something that will at least save the text content.

The next thing to think about is images or other content in the html hosted on edX. The main thing I am aware of is images but there may be other requirements. Please let me know if you know of any.

If following the simple solution of downloading the html as above we may also want to scrape some of the js files that load on to the page so that people can still see pages offline.

ideal solution

The ideal solution would be to parse the html of each specific page and grab the text, inputbox values, and deal with many other possible formats. However, it seems like that would be a lot more work.

Directory structure

I don't have strong feels here. I think a simple "pages" folder would suffice but if others have ideas on how to make it better that would be great.

Usage

It would be nice if it were the default to download all pages but we'll probably also want to add a flag to disable it.

Final Thoughts

Thank you so much to the developers of this project who have already made a really fantastic tool! I know implementing this isn't easy and I'd be willing to help out where I can but the first step is deciding how/what needs to be done. I'm trying my best to contribute by putting this together and rounding up the above tickets that can be closed out.

Please let me know how you all would like to proceed.

JohnVeness commented 4 years ago

You might want to take notice of https://github.com/EugeneLoy/edx-archive