dgorissen / coursera-dl

A script for downloading course material (video's, pdfs, quizzes, etc) from coursera.org
http://dirkgorissen.com/2012/09/07/coursera-dl-a-coursera-download-script/
GNU General Public License v3.0
1.74k stars 299 forks source link

Fails to download videos from course "compfinance-2012-001" #5

Closed darindillon closed 11 years ago

darindillon commented 11 years ago

Good news is it correctly downloads resources and stuff, but it fails to download the videos themself from "compfinance-2012-001". Log (with command line args):

BigMac:~ darin$ cd coursera/ BigMac:coursera darin$ coursera-dl -u <...> -p <...> -d . compfinance-2012-001 Authenticating as <...>... Collecting downloadable content from http://class.coursera.org/compfinance-2012-001/lecture/index Warning: Failed to find video for Welcome to Introduction to Computational Finance and Financial Econometrics (1314) Warning: Failed to find video for 1.0 Week 1 Introduction (058) Warning: Failed to find video for 1.1 Future Value, Present Value and Compounding (1702) Warning: Failed to find video for 1.2 Asset Returns (1653) Warning: Failed to find video for 1.3 Portfolio Returns (912) Warning: Failed to find video for 1.4 Dividends (400) Warning: Failed to find video for 1.5 Inflation (457) Warning: Failed to find video for 1.6 Annualizing Returns (532) Warning: Failed to find video for 1.7 Continuously Compounded Returns (1555) Warning: Failed to find video for 1.8 CC Portfolio Returns and Inflation (550) Warning: Failed to find video for 1.9 Simple Returns (401) Warning: Failed to find video for 1.10 Getting Financial Data from Yahoo (1026) Warning: Failed to find video for 1.11 Return Calculations (621) Warning: Failed to find video for 1.12 Growth of 1 (658) Warning: Failed to find video for 2.0 Week 2 Introduction (106) Warning: Failed to find video for 2.1 Univariate Random Variables (2011) Warning: Failed to find video for 2.2 Cumulative Distribution Function (842) Warning: Failed to find video for 2.3 Quantiles (750) Warning: Failed to find video for 2.4 Standard Normal Distribution (1602) Warning: Failed to find video for 2.5 Expected Value and Standard Deviation (1958) Warning: Failed to find video for 2.6 General Normal Distribution (623) Warning: Failed to find video for 2.7 Standard Deviation as a Measure of Risk (434) Warning: Failed to find video for 2.8 Normal Distribution Appropriate for simple returns (1422) Warning: Failed to find video for 2.9 Skewness and Kurtosis (1539) Warning: Failed to find video for 2.10 Students-t Distribution (552) Warning: Failed to find video for 2.11 Linear Functions of Random Variables (1113) Warning: Failed to find video for 2.12 Value at Risk (1948) Warning: Failed to find video for 3.0 Week 3 Introduction (104) Warning: Failed to find video for 3.1 Location-scale Model (1215) Warning: Failed to find video for 3.2 Bivariate Discrete Distributions (1418) Warning: Failed to find video for 3.3 Bivariate Continuous Distributions (1415) Warning: Failed to find video for 3.4 Covariance (1916) Warning: Failed to find video for 3.5 Correlation and the Bivariate Normal Distribution (1159) Warning: Failed to find video for 3.6 Linear Combination of 2 Random Variables (1109) Warning: Failed to find video for 3.7 Portfolio Example (1920) Warning: Failed to find video for 3.8 Matrix Algebra Review Part 1 (1702) Warning: Failed to find video for 3.9 Matrix Algebra Review Part 2 (2010)

(and etc etc -- lots more failures to download video here. But then it starts downloading resources and all of those actually download fine. It's just the videos that fail).

olegafx commented 11 years ago

Tried to download this course - everything was downloaded including videos

dgorissen commented 11 years ago

Just tried and it works perfectly here. Could ensure you are using the latest version (from pip or github, they should be the same) and retry. If you still have issues could you give some background on your setup (python version, os, etc).

darindillon commented 11 years ago

Unfortuantly, still not working for me. I do have the latest version of coursera-dl, installed via pip:

BigMac:site-packages darin$ sudo pip install coursera-dl --upgrade Requirement already up-to-date: coursera-dl in /Library/Python/2.7/site-packages Requirement already up-to-date: mechanize in /Library/Python/2.7/site-packages (from coursera-dl) Requirement already up-to-date: beautifulsoup4 in /Library/Python/2.7/site-packages (from coursera-dl) Requirement already up-to-date: argparse in /Library/Python/2.7/site-packages (from coursera-dl) Cleaning up...

But I'm still getting the errors listed in the original issue above. For what it's worth, I'm on a mac (version 10.8.2). Python version is 2.7.2

Looking at the script, it appears BeautifulSoup is the culprit -- the following line is returning None. vobj = bb.find('source',type="video/mp4") But the pip above claims I do have beautifulsoup. I printed out the URL that you're trying to load, and that url does correctly load in the browser. I looked at the source of that URL, and it does indeed have a "source" tag with "video/mp4", so everything appears right. Is there some way I can modify the script to trap whatever error beautifulSoup is apparently encountering?

darindillon commented 11 years ago

More information: coursera is returning a different page when run by the script than when run in the browser. I wonder if they're detecting the user-agent of my mac and editing content based on that? Here's the URL the script is hitting: https://class.coursera.org/compfinance-2012-001/lecture/view?lecture_id=31 And when I hit that in my browser, everything is fine. Viewing the source shows exactly what your script expects. HOWEVER, when I run your script, they give me a different page. Here's the entire body (all the includes BEFORE the body looked correct and are omitted here for brevity. This is just the body part):

So I think they must be returning different html based on my user agent string or something like that. Any ideas of how to trick it?

darindillon commented 11 years ago

Further further info: It's definitely BeautifulSoup that is failing. I'm on a mac (10.8.2 if that matters). Nothing to do with the user-agent string. I added the following lines to your script to see the HTML that we're getting. The HTML is exactly correct:

p = self.browser.open(lurl) html = p.read() print html

BUT! If I then pass that to BeautifulSoup, then it strips out all the relevant "source" tags, as described in the previous comment: bb = BeautifulSoup(html) print bb.prettify()

Any ideas what that might mean?

darindillon commented 11 years ago

OK! I solved it. Coursera is returning incorrect HTML on https://class.coursera.org/compfinance-2012-001/lecture/view?lecture_id=109 <div class="hidden" id="QL_aria_announcer' aria-live="assertive" aria-relevant="all">

The ID starts with a double quote but ends with a single quote. BeautifulSoup4 (at least the version I have from "pip install beautifulsoup4") seems to choke on that and ignores everything below it. But if I read the HTML first, and then switch those quotes: html = p.read() html = html.replace("id=\"QL_aria_announcer'", "id='QL_aria_announcer'") bb = BeautifulSoup(html)

Then everything works fine.

dgorissen commented 11 years ago

Cool, thanks for getting to the bottom of this. I can confirm I see the same error in the html, however its odd as it works here. Can you list your version of beautifulsoup4 and mechanize. You can get this by typing "pip freeze | grep -pkg-name-".

darindillon commented 11 years ago

I have BeautifulSoup = 3.2.1, and also BeautifulSoup4 = 4.1.3. Mechanize = 0.2.5. (But the problem isn't the 3.2.1 version -- I only installed that AFTER I noticed the script was failing with BeautifulSoup4. So the script is using the 4x verison) My full listing below. According to the BeatuifulSoup page, "Beautiful Soup sits on top of popular Python parsers like lxml and html5lib" which implies you can somehow configure it to switch between those parsers. Since it works for you but not me, my guess is yours is configured to use a different parser than mine, and your parser apparently handles this incorrect HTML better than mine does. But I don't know how to check which parser it's using. Is it possible for the script to force beautifulSoup to use the good parser instead of the bad one?

BigMac:coursera darin$ pip freeze BeautifulSoup==3.2.1 ------NOTE: Also see BeautifulSoup4 below PyRSS2Gen==1.0.0 Twisted==12.0.0 altgraph==0.9 argparse==1.2.1 bdist-mpkg==0.4.4 beautifulsoup4==4.1.3 bonjour-py==0.3 coursera-dl==1.1.9 macholib==1.4.2 mechanize==0.2.5 modulegraph==0.9.1 numpy==1.6.1 py2app==0.6.3 pyOpenSSL==0.13 pyobjc-core==2.3.2a0 pyobjc-framework-AddressBook==2.3.2a0 pyobjc-framework-AppleScriptKit==2.3.2a0 pyobjc-framework-AppleScriptObjC==2.3.2a0 pyobjc-framework-Automator==2.3.2a0 pyobjc-framework-CFNetwork==2.3.2a0 pyobjc-framework-CalendarStore==2.3.2a0 pyobjc-framework-Cocoa==2.3.2a0 pyobjc-framework-Collaboration==2.3.2a0 pyobjc-framework-CoreData==2.3.2a0 pyobjc-framework-CoreLocation==2.3.2a0 pyobjc-framework-CoreText==2.3.2a0 pyobjc-framework-DictionaryServices==2.3.2a0 pyobjc-framework-ExceptionHandling==2.3.2a0 pyobjc-framework-FSEvents==2.3.2a0 pyobjc-framework-InputMethodKit==2.3.2a0 pyobjc-framework-InstallerPlugins==2.3.2a0 pyobjc-framework-InstantMessage==2.3.2a0 pyobjc-framework-InterfaceBuilderKit==2.3.2a0 pyobjc-framework-LatentSemanticMapping==2.3.2a0 pyobjc-framework-LaunchServices==2.3.2a0 pyobjc-framework-Message==2.3.2a0 pyobjc-framework-OpenDirectory==2.3.2a0 pyobjc-framework-PreferencePanes==2.3.2a0 pyobjc-framework-PubSub==2.3.2a0 pyobjc-framework-QTKit==2.3.2a0 pyobjc-framework-Quartz==2.3.2a0 pyobjc-framework-ScreenSaver==2.3.2a0 pyobjc-framework-ScriptingBridge==2.3.2a0 pyobjc-framework-SearchKit==2.3.2a0 pyobjc-framework-ServerNotification==2.3.2a0 pyobjc-framework-ServiceManagement==2.3.2a0

dgorissen commented 11 years ago

Ok, I have been taking a closer look at this. It looks like the problem is indeed with the parser. From the bs4 docs:

If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that     
you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

You are using python 2.7.2 so its using a poor builtin parser. You can double check with:

from bs4 import BeautifulSoup
BeautifulSoup().builder

For me the output is

<bs4.builder._lxml.LXMLTreeBuilder object at 0x24c69d0>

Thus its using lxml. As suggested in the bs4 docs I have now made the parser explicit to avoid such problems in the future.

I have committed lxml as default, with a fallback if not available. I have also added a cmdline option to explicitly set the parser. Can you re-run from git and see if that all works for you?

dgorissen commented 11 years ago

Assuming issue fixed, let me know if not.