coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 639 forks source link

Error with encoding? #18

Closed feilong closed 10 years ago

feilong commented 11 years ago

I got an encoding error before downloading starts. The course link is https://www.edx.org/courses/MITx/6.00x/2013_Spring/ and the error message is as follows:

You can access 1 courses on edX
1 - 6.00x Introduction to Computer Science and Programming -> Started
Enter Course Number: 1
Traceback (most recent call last):
  File "edx-dl.py", line 146, in <module>
    soup = BeautifulSoup(courseware)
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/__init__.py", line 172, in __init__
    self._feed()
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/builder/_lxml.py", line 195, in feed
    self.parser.close()
  File "parser.pxi", line 1187, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:88786)
  File "parsertarget.pxi", line 142, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:98085)
  File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97909)
  File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:9071)
  File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src/lxml/lxml.etree.c:94081)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 2: invalid continuation byte
shk3 commented 11 years ago

Could your provide more information about the issue? I can not make it reappear.

Are the errors shown before you choose the weeks?

You can access 3 courses on edX 1 - CS188.1x Artificial Intelligence -> Started 2 - CS191x Quantum Mechanics and Quantum Computation -> Started 3 - 6.00x Introduction to Computer Science and Programming -> Started Enter Course Number: 3 6.00x Introduction to Computer Science and Programming has 12 weeks so far 1 - Download Overview videos 2 - Download Week 1 videos 3 - Download Week 2 videos 4 - Download Week 3 videos 5 - Download Week 4 videos 6 - Download Week 5 videos 7 - Download Midterm Exam 1 videos 8 - Download Week 6 videos 9 - Download Week 7 videos 10 - Download Week 8 videos 11 - Download Week 9 videos 12 - Download Peer Grading Panel videos 13 - Download them all

feilong commented 11 years ago

Sure. Please feel free to contact me if there's anything I can do to help.

This error appears just after I select the course.

Here are some additional information that might be useful:

bs4.__version__
'4.1.3'

youtube_dl.__version__
'2013.04.28'
shk3 commented 11 years ago

Are you using python2 or python3? And, what language is your operating system using?

feilong commented 11 years ago

I first used python 2, now I've tried python 3, too. Similar error occurs.

It seems related to beautifulsoup4, and I'm trying to figure out why. 在 2013-4-30 上午8:41,"George Monkey" notifications@github.com写道:

Are you using python2 or python3?

— Reply to this email directly or view it on GitHubhttps://github.com/shk3/edx-downloader/issues/18#issuecomment-17203642 .

shk3 commented 11 years ago

Thanks. Please let me know, if you figure it out.

On Tue, Apr 30, 2013 at 8:47 AM, feilong notifications@github.com wrote:

I first used python 2, now I've tried python 3, too. Similar error occurs.

It seems related to beautifulsoup4, and I'm trying to figure out why. 在 2013-4-30 上午8:41,"George Monkey" notifications@github.com写道:

Are you using python2 or python3?

— Reply to this email directly or view it on GitHub< https://github.com/shk3/edx-downloader/issues/18#issuecomment-17203642> .

— Reply to this email directly or view it on GitHubhttps://github.com/shk3/edx-downloader/issues/18#issuecomment-17203792 .

http://www.MonkeyHouse.info http://www.monkeyhouse.info/

rbrito commented 11 years ago

Hi, @feilong.

On Mon, Apr 29, 2013 at 9:47 PM, feilong notifications@github.com wrote:

I first used python 2, now I've tried python 3, too. Similar error occurs.

Depending on how you installed BeautifulSoup 4, it can use a number of parsers:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

In the original case, @feilong's system is using lxml, which is probably the fastest, but the fact that lxml encountered an invalid byte in the homepage is, indeed, a problem.

Perhaps the problem happens when lxml is trying to parse your name? I suspect that, given the original poster's name sounds like Chinese, you may have Chinese characters.

In this case, you can try to use another parser. For instance, whenever we have a call to BeautifulSoup(foo) in our code, try to enforce a different parser by passing a second argument, as described in the document listed above.

Please report back the results.

Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://rb.doesntexist.org/blog : Projects : https://github.com/rbrito/ DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br

rbrito commented 11 years ago

Ouch. The markup of the above (sent by e-mail) is atrocious. I'm rewriting the message below:

Depending on how you installed BeautifulSoup 4, it can use a number of parsers:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

In the original case, the poster is using lxml, which is probably the fastest, but the fact that lxml encountered an invalid byte in the homepage is, indeed, a problem.

Perhaps the problem happens when lxml is trying to parse your name? I suspect that, given the original poster's name sounds like Chinese, you may have Chinese characters.

In this case, you can try to use another parser. For instance, whenever we have a call to BeautifulSoup(foo) in our code, try to enforce a different parser by passing a second argument, as described in the document listed above.

Please report back the results.

shk3 commented 11 years ago

@rbrito , I think @feilong must be using Chinese system as his last reply contains Chinese characters.

I am confusing that Chinese is included in utf-8, so why lxml says 'utf8' can not parse the character? Do you mean lxml is trying to parse @feilong 's username in edx?

feilong commented 11 years ago

@rbrito , @shk3 , thank you both for your help!

I've tried using html5lib instead of lxml and it is working pretty good. So I believe there was something wrong while parsing html with lxml. I'm not sure whether it's related to Chinese characters, my edx username should only contain English characters.

Earlier today I tried saving the contents of the courseware variable to a file so I can test repeatedly without connecting to edx. Here is the short code I used to test.

#!/usr/bin/env python
from bs4 import BeautifulSoup
with open('courseware.txt','r') as f:
    cw = f.read()
    BeautifulSoup(cw, fromEncoding='UTF-8')

Interestingly, as I test it, it throws UnicodeDecodeError with different contents, like UnicodeDecodeError: 'utf8' codec can't decode byte 0xbc in position 0: invalid start byte or UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 1: invalid start byte. And in some trials, it could run without an error. I'm really puzzled: Since I'm using the same file and same code, why would the results differ from each other?