coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 640 forks source link

save all webpages for an EDX course #429

Open MATRIX30 opened 7 years ago

MATRIX30 commented 7 years ago

🚨Please review the Troubleshooting section before reporting any issue. Don't forget also to check the current issues to avoid duplicates.

Subject of the issue

Describe your issue here. I would be greatful if someone can add an additional command line argument say --pg which will enable edx-dl to save all the webpages for a particular edx course in a seperate folder within the course folder. I really need someone to help me on this. because from where I come from internet connections are not really stable and I need to be able to view this materials independent of internet connectivity. Thanks in advance

Your environment

Steps to reproduce

Tell us how to reproduce this issue. Please provide us the course URL, and the specific subsection or unit if possible.

Expected behaviour

Tell us what should happen.

Actual behaviour

Tell us what happens instead. If the script fails, please copy the entire output of the command or the stacktrace (don't forget to obfuscate your username and password). If you cannot copy the exception, attach a screenshot.

jcline-ieee commented 6 years ago

I have partially implemented this and patch included here [for 0.1.6]. The "text comments" which are normally below the video on a web page are now saved for each course Unit, into a local Section web page named "Notes-[Section title].html". This is very handy for seeing those instructor comments, sometimes homework tips, and also where other resources are used (like the PDF's). The output has limitations and isn't super pretty but the edx course text is there. Embedded graphics from the page are not fetched to local so that is still a todo (but maybe those should be added to the 'resources' download list). The result is a tree of files like this:

01-Important_Pre-Course_Survey/
02-Contact_Us/
03-How_To_Navigate_the_Course/
04-Discussion_Board/
05-Office_Hours/
06-Week_1-__Introduction_to_Data/
07-Week_2-__Univariate_Descriptive_Statistics/
08-Week_3-__Bivariate_Distributions/
09-Week_4-__Bivariate_Distributions_Categorical_Data/
10-Week_5-__Linear_Functions/
11-Week_6-__Exponential_and_Logistic_Function_Models/
12-Important_Post-Course_Survey/
Notes-01-Important_Pre-Course_Survey.html
Notes-02-Contact_Us.html
Notes-03-How_To_Navigate_the_Course.html
Notes-04-Discussion_Board.html
Notes-05-Office_Hours.html
Notes-06-Week_1-__Introduction_to_Data.html
Notes-07-Week_2-__Univariate_Descriptive_Statistics.html
Notes-08-Week_3-__Bivariate_Distributions.html
Notes-09-Week_4-__Bivariate_Distributions_Categorical_Data.html
Notes-10-Week_5-__Linear_Functions.html
Notes-11-Week_6-__Exponential_and_Logistic_Function_Models.html
Notes-12-Important_Post-Course_Survey.html 

I tested on macOs python 3.6 with a few small courses and a couple big ones, like: https://courses.edx.org/courses/course-v1:UTAustinX+UT.7.21x+3T2016/course/ Seems to work very well. Unfortunately the output page contains extraneous html but the course content is saved and readable. There are important course notes saved now, that I would not have otherwise seen.

See below PDF for what the HTML output result looks like. Foundations_of_DataAnalysis-_Part_1 - Week 1: Introduction to Data>.pdf

Warning. The "page dump" to the local HTML file is not filtered when saving and so it will contain HTML tags and/or bits of HTML script which have identifying information of the original edx user account (like user name, user id). So the result is not 'private'.

Implementation Detail

  1. Add 'doc' to the Unit object
  2. when extracting units in parsing, save the 'text' to the Unit.doc
  3. in main, before filtering out duplicate urls, loop thru the course's Units and save the Unit text from Unit.doc and output to HTML files per section at the top level. Therefore each "weekly lesson" directory has a corresponding HTML page including the notes for those Units.
  4. Add some customized HTML markup to the HTML file, like page title, and to make headers for the Unit, and also list the resources, for easy cross-referencing.

Patch

This patch will need cleanup and refactoring. But it works. To apply the patch add or modify the stuff labelled 'jcline'. Sorry no real patch diff.

edx_dl.py

import re               
import sys              
from six.moves import html_parser  # jcline

....

def main():

....

    parse_units(selections)

    if args.cache:
        write_units_to_cache(all_units)

    # jcline 
    # -- Write each edx html page content to Notes-[SECTION NAME].html
    for selected_course, selected_sections in selections.items():
        coursename = directory_name(selected_course.name)
        for selected_section in selected_sections:
            section_dirname = "%02d-%s" % (selected_section.position,
                                           selected_section.name)
            target_dir = os.path.join(args.output_dir, coursename)
            mkdir_p(target_dir)
            counter = 0
            filename = os.path.join(target_dir, "Notes-%s.html" % clean_filename(section_dirname))
            logging.info("Writing Course Section document to %s", filename)
            with open(filename, 'w') as f:
                markup = "<!DOCTYPE html>\n<html><head>\n"
                markup += "<title>%s - %s</title>\n" % (coursename, selected_section.name)
                markup += "</head>\n"
                markup += "<body><h1>%s</h1><h1>%s</h1>\n" % (coursename, selected_section.name)
                f.write(markup)
                for subsection in selected_section.subsections:
                    units = all_units.get(subsection.url, [])
                    for unit in units:
                        counter += 1
                        markup = "\n<h2> Unit %d </h2>\n" % counter
                        f.write(markup)
                        f.write(html_parser.HTMLParser().unescape(unit.doc))
                        markup = "\n<h2> Media </h2>\n<ul>\n"
                        for video in unit.videos:
                            markup += "<li>" + " ".join(video.mp4_urls) + "\n"
                        for item in unit.resources_urls:
                            markup += "<li>" + item + "\n"
                        markup += "\n</ul>\n"
                        f.write(markup)
                markup = "\n\n<p><footer>Created by edx_dl %s </footer>\n<!-- __END__ -->\n</html>\n" % __version__
                f.write(markup)
    # ---  jcline

    # This removes all repeated important urls
    # FIXME: This is not the best way to do it but it is the simplest, a
    # better approach will be to create symbolic or hard links for the repeated
    # units to avoid losing information
    filtered_units = remove_repeated_urls(all_units)
    num_all_urls = num_urls_in_units_dict(all_units)

parsing.py

class CurrentEdXPageExtractor(ClassicEdXPageExtractor):
...
    def extract_unit(self, text, BASE_URL, file_formats):

....
        resources_urls = self.extract_resources_urls(text, BASE_URL,
                                                     file_formats)
        return Unit(videos=videos, resources_urls=resources_urls, doc=text) # jcline

common.py

class Unit(object):

....

        self.videos = videos
        self.resources_urls = resources_urls
        self.doc = doc                          # jcline
jcline-ieee commented 6 years ago

Followup note regarding improvement to the above patch. There was at least one course that had a large Section with no videos, and only long reference texts. One course had a large 'Resources' set of document pages linked as the final week of the course. These sections were not output into the new Notes.html files. Perhaps since there were no videos or resource_urls in those units, maybe those Units were being internally discarded. Internal logic should be improved to save Units if there are no videos but contain instructor text, therefore can be saved locally to Notes.html.

rbrito commented 6 years ago

@jcline-ieee, please start editing the appropriate files.

GitHub will automatically create a fork of the project for you (especially handy if you're not familiar with all the mechanics of git). GitHub will hide all the complexity of using git and will make sending the changes/patches easier for us to evaluate and integrate your changes.

TheNameIsChaitanya commented 5 years ago

When I apply the patch posted by @jcline-ieee (on v0.1.10 of edx-dl) , and run the program on a mooc, I get the error:

<MYCondaEnv>\lib\site-packages\edx_dl\edx_dl.py", line 1078, in main
    f.write(html_parser.HTMLParser().unescape(unit.doc))
AttributeError: 'Unit' object has no attribute 'doc'

And just before this error, the log message was: Writing Course Section document to <SomeFolder>\Notes-01-Welcome.html

nikojpapa commented 5 years ago

When I apply the patch posted by @jcline-ieee (on v0.1.10 of edx-dl) , and run the program on a mooc, I get the error:

<MYCondaEnv>\lib\site-packages\edx_dl\edx_dl.py", line 1078, in main
    f.write(html_parser.HTMLParser().unescape(unit.doc))
AttributeError: 'Unit' object has no attribute 'doc'

And just before this error, the log message was: Writing Course Section document to <SomeFolder>\Notes-01-Welcome.html

To solve this issue, modifying these two additional lines should work.

class Unit(object) in common.py defined constructor like so: def init(self, videos, resources_urls, doc=None):

And, in the extract_unit function in parsing.py, change the return line to: return Unit(videos=videos, resources_urls=resources_urls, doc=text)

Dinesh6777 commented 5 years ago

I'm getting below error. I also tried steps mentioned by @nikojpapa

Writing Course Section document to Downloaded\Programming_with_C\Notes-01-Module_0.html
Traceback (most recent call last):
  File "C:\Users\dinesh\Desktop\edx-dl\edx-dl.py", line 6, in <module>
    edx_dl.main()
  File "C:\Users\dinesh\Desktop\edx-dl\edx_dl\edx_dl.py", line 1077, in main
    f.write(html_parsher.HTMLParser().unescape(unit.doc))
  File "C:\Users\dinesh\AppData\Local\Programs\Python\Python37-32\lib\html\parser.py", line 470, in unescape
    return unescape(s)
  File "C:\Users\dinesh\AppData\Local\Programs\Python\Python37-32\lib\html\__init__.py", line 130, in unescape
    if '&' not in s:
TypeError: argument of type 'NoneType' is not iterable