Some weeks are not downloading

dgorissen / coursera-dl

A script for downloading course material (video's, pdfs, quizzes, etc) from coursera.org

http://dirkgorissen.com/2012/09/07/coursera-dl-a-coursera-download-script/

GNU General Public License v3.0

1.74k stars 300 forks source link

Some weeks are not downloading #49

Closed aidank closed 11 years ago

aidank commented 11 years ago

I've been using coursera-dl on Ubuntu successfully for several months, but recently I've encountered an issue where only the first weeks of a completed course will download. For example, "A Beginner's Guide to Irrational Behaviour" (behavioralecon-001) has 6 weeks of videos but only the first and some of the second week download. No error message is produced as far as I can make out. I've also encountered the same issue with linearopt-001.

olegafx commented 11 years ago

Works well for me (A Beginner's Guide to Irrational Behaviour)

dgorissen commented 11 years ago

mmm Im on a mobile connection so cant test now. Which parser are you using and does using a differnet parser (e.g., html5lib) fix anything?

altimerk commented 11 years ago

Hi, I got the same issue. When I tried to download compfinance-003 course, it grabbed only first two week's lectures. I found, that issue is soup soup = BeautifulSoup(vidpage,self.parser) gets incomplete html content though it contains all closing tags such as 'body', 'html', and so on. In other courses, for example progfun-002, soap gets content which contains 'li' tags without any included div tags, and we get issue similar #18.

dgorissen commented 11 years ago

Can you confirm that this also happens with a different parser, e.g., try both html5lib and lxml.

altimerk commented 11 years ago

Hi, i've got that issue with lxml, because it's default parser in 1.4.8 version. When I changed to html.parser it was ok! When I use html5lib I get message:

Collecting downloadable content from https://class.coursera.org/progfun-002/lecture/index Warning: no downloadable content found for progfun-002, did you accept the honour code?

aidank commented 11 years ago

I see the same behaviour on my machine as reported by altimerk above: using the defaukt (lxml) the download stops prematurely with some weeks missing; using html5lib nothing downloads and a warning is produced ("Warning: no downloadable content found for behavioralecon-001, did you accept the honour code?"); using html.parser appears to solve the problem and everything downloads successfully.

Many thanks for helping to resolve this!

dgorissen commented 11 years ago

Good thats resolved. In light of all this it probably makes sense to use html.parser as the default.