betterman08 / pydelicious

Automatically exported from code.google.com/p/pydelicious
Other
0 stars 0 forks source link

Unicode #17

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
How does pydelicious handle unicode?  The characters stored/retrieved are
different!

In [114]: orig=u'LaTeX project: LaTeX \x96 A document preparation system' 

In [117]:
pydelicious.add('...','...','http://www.latex-project.org/',orig,tags='LaTeX',re
place='yes')
Out[117]: {'result': (True, 'done')}

In [119]: xs=pydelicious.get('...','...','LaTeX')['posts']

In [123]: orig==xs[0]['description']
Out[123]: False

In [125]: back=xs[0]['description']

In [126]: back
Out[126]: u'LaTeX project: LaTeX \u2013 A document preparation system'

In [127]: orig.encode('utf-8')
Out[127]: 'LaTeX project: LaTeX \xc2\x96 A document preparation system'

In [128]: back.encode('utf-8')
Out[128]: 'LaTeX project: LaTeX \xe2\x80\x93 A document preparation system'

In [129]: Out[127]==Out[128]
Out[129]: False

Original issue reported on code.google.com by yanghate...@gmail.com on 30 Apr 2008 at 5:37

GoogleCodeExporter commented 9 years ago
It seems to me that u'\x96' is just not correct Python Unicode string -- could 
you
have non-Unicode character in Unicode string?

Original comment by matej.c...@gmail.com on 8 May 2008 at 10:57

GoogleCodeExporter commented 9 years ago
Sorry, I don't fully understand this question. What do you mean by "not 
correct"? Do 
you have a pointer to something where I can learn more about why these 
non-Unicode 
characters (in the context of Unicode strings)? We are also interested in 
getting to 
the bottom of this with gbookmark2delicious project. Thanks.

Original comment by yaa...@gmail.com on 8 May 2008 at 5:06

GoogleCodeExporter commented 9 years ago
After a ton of experimentation, I think I've got it all figured out - one must 
use 
the 'utf-8' codec instead of the 'iso-8859-1' codec.  I advise changing the 
default 
codec in DeliciousAPI's constructor.

E.g., if you try to post_add something with the string '\xf6', then delicious 
misinterprets that and stores the wrong character (if you query it, it gives 
you 
u'\u2298').  If OTOH you send it the utf-8-encoded string '\xc3\xb6', you'll 
get 
back the same string.

Original comment by yaa...@gmail.com on 13 May 2008 at 6:12

GoogleCodeExporter commented 9 years ago
Hmmm.. I *think*, the 'encode' is only relevant when someone passes in unicode
strings instead of plain strings to the DeliciousAPI methods.

yaaang: what is your locale encoding?

But the handling in _call_server is not correct. I think the following would be 
the
right way to ensure we post plain (byte) strings to del.icio.us:

    if isinstance(params[key], unicode):
        params[key] = params[key].encode(self.codec)

The thing I am left wondering about is how the server interprets these bytes.
Neither XML nor HTTP headers indicate encoding, presumably XML's default: utf-8.
The elementtree XML parsing always seems to return unicode strings for these...

I work in an UTF-8 environment but what about people using latin-1/ISO-8859-1 
encoded
strings in their bookmarks? 

With the above code any unicode strings I pass to the instance get handled 
correctly:

In [231]: da = pydelicious.DeliciousAPI('mpe', passwd, codec='utf-8')

In [232]: da.posts_add('cid:codec-testing-1@del.icio.us', unicode('★', 
'utf-8'),
replace=True)
Out[232]: {'result': (True, 'done')}

In [233]: da.posts_add('cid:codec-testing-2@del.icio.us', '★', replace=True)
Out[233]: {'result': (True, 'done')}

In [234]: for u in 'cid:codec-testing-1@del.icio.us',
'cid:codec-testing-2@del.icio.us': da.posts_get(url=u)
   .....:
Out[234]:
{'dt': '2008-06-02',
 'posts': [{'description': u'\u2605',
            'hash': '15a97870f0707fb9d33496391eac572f',
            'href': 'cid:codec-testing-1@del.icio.us',
            'others': '',
            'shared': 'no',
            'tag': 'system:unfiled',
            'time': '2008-06-02T15:56:12Z'}],
 'tag': '',
 'user': 'mpe'}
Out[234]:
{'dt': '2008-06-02',
 'posts': [{'description': u'\u2605',
            'hash': '5caff95c3d3ea03a7598f300419a3848',
            'href': 'cid:codec-testing-2@del.icio.us',
            'others': '',
            'shared': 'no',
            'tag': 'system:unfiled',
            'time': '2008-06-02T15:56:25Z'}],
 'tag': '',
 'user': 'mpe'}

So both have the same result and delicious either uses or recognizes UTF-8.

Original comment by berend.v...@gmail.com on 2 Jun 2008 at 3:58

GoogleCodeExporter commented 9 years ago
Err, which is:
- '★' # plain string: '\xe2\x98\x85'
- unicode('★', 'utf-8') # unicode string: u'\u2605'

Original comment by berend.v...@gmail.com on 2 Jun 2008 at 4:01

GoogleCodeExporter commented 9 years ago
ok. Encoding issues should have been resolved now and commited.

BTW, see tests/test_encodings.py to see encoding/decoding utf8 and latin1 in 
action.

Original comment by berend.v...@gmail.com on 28 Nov 2008 at 3:57