IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 45 forks source link

UnicodeEncodeError: 'charmap' codec can't encode character; character maps to <undefined> #9

Closed cloudywings2 closed 5 years ago

cloudywings2 commented 5 years ago

Hello! I get this error when I try to run this program on some groups. Is there a way to bypass this error so the whole program doesn't stop? Thanks!

cloudywings2 commented 5 years ago

I think I was able to fix it by changing line 86 to

name = unescape_html(path['fileName']).encode("utf-8")

and line 114 to

pname = unescape_html(photo['photoName']).encode("utf-8")

It's working for me so far anyway.

kekkc commented 5 years ago

I think I was able to fix it by changing line 86 to

name = unescape_html(path['fileName']).encode("utf-8")

and line 114 to

pname = unescape_html(photo['photoName']).encode("utf-8")

It's working for me so far anyway.

Nice tip. I guess this would also solve all the other httperror 400 / 500 issues currently listed (because I placed "encode("utf-8")" somewhere else initially, which caused the script to generate http errors & stop after one http errors).

This should definitely be included in the master version, now that yahoo will go down.

IgnoredAmbience commented 5 years ago

Could you give an example of a group that caused this error please, so that I may test a new version?

n4mwd commented 5 years ago

I'm having the same problem. It seems to freak out with non-ascii characters. It should sanitize the file name by replacing unknown chars with '-' or something equally benign. Or maybe the hex encoding like "%4df1%.The one that got me was a single right quote sign. Like a ' but with an unusual charset.

tripleee commented 5 years ago

Here's one recent traceback, unfortunately from a private group.

* Fetching album 'xxx 2018' (1/18)
** Fetching photo 'yyy zzz 61x92 vvv wwww nnnn' (1/4)                       
Traceback (most recent call last):
  File "./yahoo.py", line 474, in <module>                                     
    archive_photos(yga)
  File "./yahoo.py", line 193, in archive_photos                               
    print "** Fetching photo '%s' (%d/%d)" % (pname, p, photos['total'])       
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0308' in position 21: ordinal not in range(128)

I would assume that this is the same bug, though there are several variations. The tracebacks I have are all related to invalid characters, not so much improper HTML encoding.

This particular one seems to come from the diagnostic print, not actually being unable to create a file with that name.

IgnoredAmbience commented 5 years ago

I suspect this is now fixed on master.