coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 639 forks source link

UnicodeDecodeError #126

Closed mgorotiza closed 9 years ago

mgorotiza commented 10 years ago

I was able to download the first three weeks of Harvard's CS50 successfully and then got this error:

[download] Destination: Downloaded/CS50x Introduction to Computer Science/88-VigenTraceback (most recent call last): File "edx-dl.py", line 445, in main() File "edx-dl.py", line 413, in main print(tmp, end="") File "edx-dl.py", line 103, in print texts.append(original_text.encode(enc, errors='replace').decode(enc)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) Manuel-Gorotizas-MacBook-Pro:edx-downloader-master Manny$

A previous edit helped me fix the last error I had, but I don't see anything on this one.

Any suggestions?

Thank you for your help,

Manny

GuiAlmeidaPC commented 10 years ago

This happened to me; the problem is a special character. In order to solve this problem, I just commented the print function in the lines 83 to 104.

mgorotiza commented 10 years ago

Could you show me how that would look? I'm really new at this. Would it be like " // print" ?

I really appreciate your help, thanks.

GuiAlmeidaPC commented 10 years ago

Yes, but, in python, single line comments are represented with "#". It looks like the following:

# To replace the print function, the following function must be placed before any other call for print
#def print(*objects, **kwargs):
    #"""
    #Overload the print function to adapt for the encoding bug in Windows Console.
    #It will try to convert text to the console encoding before print to prevent crashes.
    #"""
    #try:
        #stream = kwargs.get('file', None)
        #if stream is None:
            #stream = sys.stdout
        #enc = stream.encoding
        #if enc is None:
            #enc = sys.getdefaultencoding()
    #except AttributeError:
        #return __builtins__.print(*objects, **kwargs)
    #texts = []
    #for object in objects:
        #try:
            #original_text = str(object)
        #except UnicodeEncodeError:
            #original_text = unicode(object)
        #texts.append(original_text.encode(enc, errors='replace').decode(enc))
    #return __builtins__.print(*texts, **kwargs)

Note that I'm using Linux, so I'm not sure if this works in Windows. If you still encounter problems, I'll be glad to help.

florianbuetow commented 9 years ago

In case someone wants to recreate it for debugging purposes: This error also occurs when trying to download the "edX Demonstration Course".

iemejia commented 9 years ago

@fbcom I tried with this course https://courses.edx.org/courses/edX/DemoX.1/2014/info And I didn't have any issue, is this the one you are talking about ?

phonx commented 9 years ago

It seem course with Chinese character can't be download eg: https://courses.edx.org/courses/course-v1:PekingX+20000001x+2015T1/info

florianbuetow commented 9 years ago

@iemejia I've tried it again and now everything seems to work fine.

iemejia commented 9 years ago

This commit seems to solve this issue. If you still have problems, please report here, with a clear example (course and week where you find the problem, as well as python version).

KwToPA commented 8 years ago

OSX 10.10.5 python 2.7.6

proxychains4 python edx-dl.py -u 1354138253@qq.com -o /document https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/info
[proxychains] config file found: /usr/local/Cellar/proxychains-ng/4.7/etc/proxychains.conf
[proxychains] preloading /usr/local/Cellar/proxychains-ng/4.7/lib/libproxychains4.dylib
[proxychains] DLL init
[proxychains] DLL init
Password: 
Building initial headers for future requests.
Getting initial CSRF token.
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
Extracting course information from dashboard.
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
Downloading English Grammar and Essay Writing [BerkeleyX/ColWri.2.2x/1T2015]
Downloading 5 section(s)
Section  1: Week 1: Vocabulary Development
  Welcome to College Writing 2.2x! 
  Methods for Vocabulary Development 
  Vocabulary in Context 
  Week 1 Additional Homework 
Section  2: Week 2: Understanding Tone and Diction
  Appropriate Diction 
  Understanding Connotations 
  Choosing Your Tone in Writing 
  Homework: Using tone 
  Week 2 Additional Homework 
Section  3: Week 3:  Common Errors in Writing
  Wordiness 
  Misspellings 
  Grammar 
  Week 3 Additional Homework 
Section  4: Week 4: Advanced Process Writing
  Pre-Writing Practices 
  Three Brainstorming Techniques 
  Your Essay: Draft 
  Week 4 Quiz 
  None
Section  5: Week 5 : Advanced Revision, Proofreading, and Editing
  Writing and Revision 
  Editing Techniques 
  Final Essay: Submission 
  Week 5 Additional Homework 
Extracting all units information in parallel.
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/42e28dbf0b81488887be0f92a44484c9/19a7ac548119487181e1f466cf48444c/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/42e28dbf0b81488887be0f92a44484c9/4fe41105a81b4c21b5ed860c42a70212/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/42e28dbf0b81488887be0f92a44484c9/37979c9a7d784b4db16f654aebfa2801/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/42e28dbf0b81488887be0f92a44484c9/40b3be1cda554466a9ed5930665bfa53/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f714e4736d444d819320d6d38a474e8d/60d546032042414bb976a311ab93967f/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f714e4736d444d819320d6d38a474e8d/8b3df860dcf6412f86263d694a4577f7/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f714e4736d444d819320d6d38a474e8d/a31d720f318f4d71bad07111e45a3609/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f714e4736d444d819320d6d38a474e8d/c7b9062b0d0d4387984ae34eae47b2ab/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f714e4736d444d819320d6d38a474e8d/233d7b7adc1747b696d217b4ceda8db1/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/e30211a9f78d435cb700935a1a1abb77/f19dd621b87745a49fd12caac1acfabf/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/e30211a9f78d435cb700935a1a1abb77/e268b5d391454d55a23c774031386534/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/e30211a9f78d435cb700935a1a1abb77/95712a23668d499599a7d8e7b808a899/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/e30211a9f78d435cb700935a1a1abb77/a9420b72f7ff4a6aa5c29c6cfe2f5bf3/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f82841c03e5f4f4fb9399c17aa7a837a/8f9fddffd57d4a07a6ee6badcb792691/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f82841c03e5f4f4fb9399c17aa7a837a/3b03caefacf6404b97ced8b870d951cf/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f82841c03e5f4f4fb9399c17aa7a837a/e2f001e2413744989a82418d51163224/'
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
 ...  OK
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f82841c03e5f4f4fb9399c17aa7a837a/46531c27bf304d30b02a45628719157d/'
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/f82841c03e5f4f4fb9399c17aa7a837a/a0a1e250a28f45ac999fac034befb7bf/'
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/536f0f7655654c5892a596c13668a2be/3ccadcf34bf641b18e4dded7c8ab9143/'
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 [proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
 ...  OK
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/536f0f7655654c5892a596c13668a2be/6533f71b4d1f427390f3fb8efb306e6a/'
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/536f0f7655654c5892a596c13668a2be/33165f3c53854c0f8563622e8b87da2f/'
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443 Processing 'https://courses.edx.org/courses/BerkeleyX/ColWri.2.2x/1T2015/courseware/536f0f7655654c5892a596c13668a2be/26cb9e32c78b4e9c998ff8d9a2fbc44b/'
[proxychains] Strict chain  ...  127.0.0.1:1080  ...  courses.edx.org:443  ...  OK
 ...  OK
Removed 0 duplicated urls from 39 in total
Output directory: /document (most recent call last):
  File "edx-dl.py", line 6, in <module>
    edx_dl.main()
  File "/Users/document/edx-dl/edx-dl/edx_dl/edx_dl.py", line 1038, in main
    download(args, selections, all_units, headers)
  File "/Users/document/edx-dl/edx-dl/edx_dl/edx_dl.py", line 815, in download
    clean_filename(section_dirname))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 80, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 27: ordinal not in range(128)
KwToPA commented 8 years ago

I think because I use the Chinese character like /文件

-o /document/文件

if I use following path, it is ok

-o /document

Anybody solve this bug ?