Chillee / coursera-dl-all

MIT License
190 stars 54 forks source link

algo-004 and algo2-003 improperly resolve (and miscellaneous other issues) #3

Closed joshuawn closed 7 years ago

joshuawn commented 8 years ago

The lecture videos can be found & are properly scraped, but the courses themselves resolve to a 404 error page. In other words, the deprecation process has already begun for some of these courses even though the videos remain intact.

Might need to add in specific checks for certain reported courses to only scrape videos if the course directory pages are down so the entire program doesn't crash. However, since the deprecation process is ongoing, might as well just handle the selenium web driver exceptions more elegantly so the entire program doesn't crash just due to one course (or even one page of a course). Maybe output all exceptions to a log file, and alert the user through the terminal before the script ends that there were errors logged in output file LOG_FILE_NAME.

Also, some course videos won't get parsed until you're fully enrolled (algs4partII-007). Since your script now supports automatic enrollment, videos should be handled after the quizzes and assignments. Running coursera-dl with the --clear-cache argument also helps when the script is re-run using different Coursera accounts.

Fully transitioning to Selenium requires custom capabilities for Firefox. Right now, the Marionette web driver isn't automatically bundled with Firefox & explicit PATH permissions must be stated. Including a link to this might help new users install your script: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver

Since Firefox's capabilities need to be explicitly sent, using session.quit() is more reliable than session.close().

Also, there's a race condition of some sort when quizzes are captured. Here's a stack traceback I received while parsing algs4partII-007:

Traceback (most recent call last):
  File "dl_all.py", line 301, in <module>
    download_all_quizzes(session, quiz_info, i[1])
  File "dl_all.py", line 190, in download_all_quizzes
    download_quiz(session, quiz_obj, category_name)
  File "dl_all.py", line 183, in download_quiz
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 104, in download_all_zips_on_page
    url = i.get_attribute('href')
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 111, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 456, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed.

Waiting on attributes is dissfactory, so you may need to reapproach how to ascertain if a link is fully loaded. Possible solutions: https://blog.mozilla.org/webqa/2012/07/12/how-to-webdriverwait/ https://github.com/angular/protractor/issues/610 http://stackoverflow.com/questions/5709204/random-element-is-no-longer-attached-to-the-dom-staleelementreferenceexception https://media.readthedocs.org/pdf/marionette_client/latest/marionette_client.pdf (useful if Marionette-enabled Firefox is used)

Here are some of the revisions I made to handle two of the aforementioned issues:

#include the following import
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

for i in reader:

    class_url, class_slug = get_class_url_info(i)
    print(class_url, class_slug)
    mkdir_safe(class_slug)
    os.chdir(class_slug)

    # session = dryscrape.Session()
    session=''
    if args.headless:
        session = webdriver.PhantomJS()
    else:
        firefox_capabilities = DesiredCapabilities.FIREFOX
        firefox_capabilities['marionette'] = True
        firefox_capabilities['binary'] = '/usr/bin/firefox'      # binary path could be handled better to support multi-platform portability.
        session = webdriver.Firefox(capabilities=firefox_capabilities)
    print("Logging In....")
    error = login(session, class_url, args.u, args.p )
    if (error==-1):
        session.close()
        continue
    print("Logged in!")
    # if

    if not args.ns:
        download_sidebar_pages(session)

    if (args.q):
        # quiz_info = get_quiz_info(session)
        print("Downloading Quizzes....")
        quiz_links = get_quiz_types(session)
        for i in quiz_links:
            print("Downloading "+i[1])
            quiz_info = get_quiz_info(session, i[0], i[1])
            download_all_quizzes(session, quiz_info, i[1])
    # print(class_url)
    if (args.a):
        mkdir_safe("assignments")
        assign_info = get_assign_info(session)
        download_all_assignments(session, assign_info)

    session.quit()

    os.chdir('..')
    if (args.v):
        os.system('coursera-dl --clear-cache -u '+args.u+' -p '+args.p+' --path='+os.getcwd()+' '+class_slug)
joshuawn commented 8 years ago

Also, I believe the honor code acceptance section of the login procedure neglects the 3 radio button survey on the top. I'm a complete novice when it comes to Selenium & parsing websites, so I hacked together the following fix by iterating through all radio buttons and using an implicit wait:

def login(session, URL, email, password):   #ugly ugly code in here
    session.get(URL)
    # print(session.find_elements_by_css_selector('#user-modal-email')))
    WebDriverWait(session, 30).until(
        lambda session: len(session.find_elements_by_css_selector('#user-modal-email'))>2)

    x = session.find_elements_by_css_selector('#user-modal-email')[1]
    x.send_keys(email)
    x = session.find_elements_by_css_selector('#user-modal-password')[1]
    x.send_keys(password)
    # print(os.getcwd())
    render(session, os.getcwd()+'/entered_login')
    session.find_elements_by_css_selector('form > button')[1].click()

    WebDriverWait(session, 30).until(
        lambda session: len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1 or
                        len(session.find_elements_by_css_selector('.c-coursePage-sidebar-enroll-button'))>=1 or
                        len(session.find_elements_by_css_selector('#agreehonorcode'))>=1 or
                        session.page_source.find('Remove from watchlist')!=-1)

    if len(session.find_elements_by_css_selector('.c-coursePage-sidebar-enroll-button')) >=1:
        session.find_elements_by_css_selector('.c-coursePage-sidebar-enroll-button')[0].click()  #enroll button
        WebDriverWait(session, 10).until(lambda session:  len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1 or
                                                          len(session.find_elements_by_css_selector('.fullbleed'))>=1 or
                                                          session.page_source.find('we will notify you by email when it starts')!=-1 or
                                                          session.page_source.find('ll email you if there are new session dates')!=-1)
        if len(session.find_elements_by_css_selector('.fullbleed'))>=1 and session.find_elements_by_css_selector('.fullbleed')[0].text.find('Learn more')==-1:
            session.implicitly_wait(10)
            session.find_elements_by_css_selector('.fullbleed')[0].click() #go to course button
            WebDriverWait(session, 10).until(lambda session: len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1 or
                                                         len(session.find_elements_by_css_selector('#agreehonorcode'))>=1)
            if len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1:
                pass
            elif len(session.find_elements_by_css_selector('#agreehonorcode'))>=1:
                for i in session.find_elements_by_xpath("//*[@type='radio']"):
                    i.click()
                session.find_elements_by_css_selector('#agreehonorcode')[0].click()
                wait_for_load(session)
        elif len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1:
            pass
        else:
            print("Error: Impossible to access course"+URL)
            return -1
    elif len(session.find_elements_by_css_selector('#agreehonorcode'))>=1:
        session.implicitly_wait(10)
        for i in session.find_elements_by_xpath("//*[@type='radio']"):
            i.click()
        session.find_elements_by_css_selector('#agreehonorcode')[0].click()
        wait_for_load(session)
    elif session.page_source.find('Remove from watchlist')!=-1:
        print("Error: Impossible to access course"+URL)
        return -1

    render(session, os.getcwd()+'/course_home')
    return 0

I'm sure you can handle this much more elegantly than I can.

Also, thank you for addressing these issues so quickly even during Father's Day. Hope we can download all these courses before they're gone for good.

joshuawn commented 8 years ago

Stale element caused by Problem 7.4 in intrologic-005 due to dynamic content:

Traceback (most recent call last):
  File "dl_all.py", line 313, in <module>
    download_all_quizzes(session, quiz_info, i[1])
  File "dl_all.py", line 199, in download_all_quizzes
    download_quiz(session, quiz_obj, clean_filename(category_name))
  File "dl_all.py", line 192, in download_quiz
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 111, in download_all_zips_on_page
    links = [i.get_attribute('href') for i in links]
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 111, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 456, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed.

Worked around by forcing download_all_zips_on_page to sleep for 5 seconds before executing:

def download_all_zips_on_page(session, path='assignments'):
    time.sleep(5)
    links = session.find_elements_by_css_selector('a')

    if not os.path.exists(path):
        os.makedirs(path)
    txt_file = open(path+'/links.txt', 'w')
    links = [i.get_attribute('href') for i in links]

    for i in links:
        url = i
        if url==None:
            continue
        url_ex = os.path.splitext(url)[1]
        txt_file.write(url+'\n')
        hw_strings = ['.zip', '.py', '.m', '.pdf', '.txt']
        is_hw = False
        for j in hw_strings:
            if url_ex.find(j)!=-1:
                is_hw = True
                continue

        if is_hw:
            # print(url)
            if url in downloaded_links:
                continue
            else:
                downloaded_links.add(url)
            try:
                if sys.version_info >= (3, 0):
                    urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
                else:
                    urllib.urlretrieve(url, path+url[url.rfind('/'):])
            except urllib.error.HTTPError:
                print("Failed to download "+url)
                continue
            render(session, os.getcwd()+'/'+path+'/zip_page')
Chillee commented 7 years ago

This project has been deprecated for a while. Just doing some cleaning.