I am making a new issue to have complete ticket to point to that will easily be found when people are wondering why they can't download: quizzes, notes, assignment, assessments, hangouts sheets, knowledge checks, questions, or html content in general.
It seems that everyone is all these comments are on the same page. Downloading content besides videos/pdf is very valuable. However, it's tricky to implement. It seems to be part of that trickieness is due to the variety of course structures (see #102). However, that may have improved since 2014.
These OPEN tickets are related:
102 -
253 - ask to download pages with embedded html
283 - has a lot of discussion of download hierarchy
337 - using edx-platform-api (seems unlikely)
429 - includes a patch that used to work
447 -
524 -
550 -
561 -
596 -
I'd recommend closing some of those tickets since they'll all see the link to this new ticket.
Temporary Fix
As a stopgap measure I've written a very small js script folks can run in the browser to save content they want. https://github.com/RayBB/edx-scrape
Implementation
To get this setup in edx-dl it looks like we'll need to determine a few things.
What to download
Directory structure
Usage
What to Download
For a very naive solution, we could start by making something similar to the script I added above. That would mean going to to the "progress" page, grabbing all of the links there, and then downloading all of those pages.
That will give something that will at least save the text content.
The next thing to think about is images or other content in the html hosted on edX. The main thing I am aware of is images but there may be other requirements. Please let me know if you know of any.
If following the simple solution of downloading the html as above we may also want to scrape some of the js files that load on to the page so that people can still see pages offline.
ideal solution
The ideal solution would be to parse the html of each specific page and grab the text, inputbox values, and deal with many other possible formats. However, it seems like that would be a lot more work.
Directory structure
I don't have strong feels here. I think a simple "pages" folder would suffice but if others have ideas on how to make it better that would be great.
Usage
It would be nice if it were the default to download all pages but we'll probably also want to add a flag to disable it.
Final Thoughts
Thank you so much to the developers of this project who have already made a really fantastic tool! I know implementing this isn't easy and I'd be willing to help out where I can but the first step is deciding how/what needs to be done. I'm trying my best to contribute by putting this together and rounding up the above tickets that can be closed out.
Please let me know how you all would like to proceed.
[Feature Request] Download all content
I am making a new issue to have complete ticket to point to that will easily be found when people are wondering why they can't download: quizzes, notes, assignment, assessments, hangouts sheets, knowledge checks, questions, or html content in general.
It seems that everyone is all these comments are on the same page. Downloading content besides videos/pdf is very valuable. However, it's tricky to implement. It seems to be part of that trickieness is due to the variety of course structures (see #102). However, that may have improved since 2014.
These OPEN tickets are related:
102 -
253 - ask to download pages with embedded html
283 - has a lot of discussion of download hierarchy
337 - using edx-platform-api (seems unlikely)
429 - includes a patch that used to work
447 -
524 -
550 -
561 -
596 -
I'd recommend closing some of those tickets since they'll all see the link to this new ticket.
Temporary Fix
As a stopgap measure I've written a very small js script folks can run in the browser to save content they want. https://github.com/RayBB/edx-scrape
Implementation
To get this setup in edx-dl it looks like we'll need to determine a few things.
What to Download
For a very naive solution, we could start by making something similar to the script I added above. That would mean going to to the "progress" page, grabbing all of the links there, and then downloading all of those pages.
That will give something that will at least save the text content.
The next thing to think about is images or other content in the html hosted on edX. The main thing I am aware of is images but there may be other requirements. Please let me know if you know of any.
If following the simple solution of downloading the html as above we may also want to scrape some of the js files that load on to the page so that people can still see pages offline.
ideal solution
The ideal solution would be to parse the html of each specific page and grab the text, inputbox values, and deal with many other possible formats. However, it seems like that would be a lot more work.
Directory structure
I don't have strong feels here. I think a simple "pages" folder would suffice but if others have ideas on how to make it better that would be great.
Usage
It would be nice if it were the default to download all pages but we'll probably also want to add a flag to disable it.
Final Thoughts
Thank you so much to the developers of this project who have already made a really fantastic tool! I know implementing this isn't easy and I'd be willing to help out where I can but the first step is deciding how/what needs to be done. I'm trying my best to contribute by putting this together and rounding up the above tickets that can be closed out.
Please let me know how you all would like to proceed.