Closed joshuawn closed 7 years ago
Also, I believe the honor code acceptance section of the login procedure neglects the 3 radio button survey on the top. I'm a complete novice when it comes to Selenium & parsing websites, so I hacked together the following fix by iterating through all radio buttons and using an implicit wait:
def login(session, URL, email, password): #ugly ugly code in here
session.get(URL)
# print(session.find_elements_by_css_selector('#user-modal-email')))
WebDriverWait(session, 30).until(
lambda session: len(session.find_elements_by_css_selector('#user-modal-email'))>2)
x = session.find_elements_by_css_selector('#user-modal-email')[1]
x.send_keys(email)
x = session.find_elements_by_css_selector('#user-modal-password')[1]
x.send_keys(password)
# print(os.getcwd())
render(session, os.getcwd()+'/entered_login')
session.find_elements_by_css_selector('form > button')[1].click()
WebDriverWait(session, 30).until(
lambda session: len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1 or
len(session.find_elements_by_css_selector('.c-coursePage-sidebar-enroll-button'))>=1 or
len(session.find_elements_by_css_selector('#agreehonorcode'))>=1 or
session.page_source.find('Remove from watchlist')!=-1)
if len(session.find_elements_by_css_selector('.c-coursePage-sidebar-enroll-button')) >=1:
session.find_elements_by_css_selector('.c-coursePage-sidebar-enroll-button')[0].click() #enroll button
WebDriverWait(session, 10).until(lambda session: len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1 or
len(session.find_elements_by_css_selector('.fullbleed'))>=1 or
session.page_source.find('we will notify you by email when it starts')!=-1 or
session.page_source.find('ll email you if there are new session dates')!=-1)
if len(session.find_elements_by_css_selector('.fullbleed'))>=1 and session.find_elements_by_css_selector('.fullbleed')[0].text.find('Learn more')==-1:
session.implicitly_wait(10)
session.find_elements_by_css_selector('.fullbleed')[0].click() #go to course button
WebDriverWait(session, 10).until(lambda session: len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1 or
len(session.find_elements_by_css_selector('#agreehonorcode'))>=1)
if len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1:
pass
elif len(session.find_elements_by_css_selector('#agreehonorcode'))>=1:
for i in session.find_elements_by_xpath("//*[@type='radio']"):
i.click()
session.find_elements_by_css_selector('#agreehonorcode')[0].click()
wait_for_load(session)
elif len(session.find_elements_by_css_selector(SIDEBAR_LOAD_URL)) >=1:
pass
else:
print("Error: Impossible to access course"+URL)
return -1
elif len(session.find_elements_by_css_selector('#agreehonorcode'))>=1:
session.implicitly_wait(10)
for i in session.find_elements_by_xpath("//*[@type='radio']"):
i.click()
session.find_elements_by_css_selector('#agreehonorcode')[0].click()
wait_for_load(session)
elif session.page_source.find('Remove from watchlist')!=-1:
print("Error: Impossible to access course"+URL)
return -1
render(session, os.getcwd()+'/course_home')
return 0
I'm sure you can handle this much more elegantly than I can.
Also, thank you for addressing these issues so quickly even during Father's Day. Hope we can download all these courses before they're gone for good.
Stale element caused by Problem 7.4 in intrologic-005 due to dynamic content:
Traceback (most recent call last):
File "dl_all.py", line 313, in <module>
download_all_quizzes(session, quiz_info, i[1])
File "dl_all.py", line 199, in download_all_quizzes
download_quiz(session, quiz_obj, clean_filename(category_name))
File "dl_all.py", line 192, in download_quiz
download_all_zips_on_page(session, path)
File "dl_all.py", line 111, in download_all_zips_on_page
links = [i.get_attribute('href') for i in links]
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 111, in get_attribute
resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 456, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed.
Worked around by forcing download_all_zips_on_page to sleep for 5 seconds before executing:
def download_all_zips_on_page(session, path='assignments'):
time.sleep(5)
links = session.find_elements_by_css_selector('a')
if not os.path.exists(path):
os.makedirs(path)
txt_file = open(path+'/links.txt', 'w')
links = [i.get_attribute('href') for i in links]
for i in links:
url = i
if url==None:
continue
url_ex = os.path.splitext(url)[1]
txt_file.write(url+'\n')
hw_strings = ['.zip', '.py', '.m', '.pdf', '.txt']
is_hw = False
for j in hw_strings:
if url_ex.find(j)!=-1:
is_hw = True
continue
if is_hw:
# print(url)
if url in downloaded_links:
continue
else:
downloaded_links.add(url)
try:
if sys.version_info >= (3, 0):
urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
else:
urllib.urlretrieve(url, path+url[url.rfind('/'):])
except urllib.error.HTTPError:
print("Failed to download "+url)
continue
render(session, os.getcwd()+'/'+path+'/zip_page')
This project has been deprecated for a while. Just doing some cleaning.
The lecture videos can be found & are properly scraped, but the courses themselves resolve to a 404 error page. In other words, the deprecation process has already begun for some of these courses even though the videos remain intact.
Might need to add in specific checks for certain reported courses to only scrape videos if the course directory pages are down so the entire program doesn't crash. However, since the deprecation process is ongoing, might as well just handle the selenium web driver exceptions more elegantly so the entire program doesn't crash just due to one course (or even one page of a course). Maybe output all exceptions to a log file, and alert the user through the terminal before the script ends that there were errors logged in output file LOG_FILE_NAME.
Also, some course videos won't get parsed until you're fully enrolled (algs4partII-007). Since your script now supports automatic enrollment, videos should be handled after the quizzes and assignments. Running coursera-dl with the --clear-cache argument also helps when the script is re-run using different Coursera accounts.
Fully transitioning to Selenium requires custom capabilities for Firefox. Right now, the Marionette web driver isn't automatically bundled with Firefox & explicit PATH permissions must be stated. Including a link to this might help new users install your script: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver
Since Firefox's capabilities need to be explicitly sent, using session.quit() is more reliable than session.close().
Also, there's a race condition of some sort when quizzes are captured. Here's a stack traceback I received while parsing algs4partII-007:
Waiting on attributes is dissfactory, so you may need to reapproach how to ascertain if a link is fully loaded. Possible solutions: https://blog.mozilla.org/webqa/2012/07/12/how-to-webdriverwait/ https://github.com/angular/protractor/issues/610 http://stackoverflow.com/questions/5709204/random-element-is-no-longer-attached-to-the-dom-staleelementreferenceexception https://media.readthedocs.org/pdf/marionette_client/latest/marionette_client.pdf (useful if Marionette-enabled Firefox is used)
Here are some of the revisions I made to handle two of the aforementioned issues: