couchdb-dump cannot deal with unicode characters in doc ids

djc commented 10 years ago

From heshim...@gmail.com on May 14, 2011 09:58:59

What steps will reproduce the problem? 1.Create a document in couchdb, with some Chinese character like "文档" 2.Run couchdb-dump on the database What is the expected output? What do you see instead? couchdb-dump crashes upon reaching this document. Here are the last lines of the trace: File "/pylonsenv/lib/python2.6/site-packages/couchdb/multipart.py", line 122, in init self._write_headers(headers) File "/pylonsenv/lib/python2.6/site-packages/couchdb/multipart.py", line 175, in _write_headers self.fileobj.write(headers[name]) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) What version of the product are you using? On what operating system? couchdb-python 0.8 against couchdb 1.0.1 on Ubuntu.

Original issue: http://code.google.com/p/couchdb-python/issues/detail?id=179

djc commented 10 years ago

From heshim...@gmail.com on May 14, 2011 01:20:31

I just needed a quick solution to dump the database and reload it in another environment. So I made some changes to multipart.py to get pass this utf-8 thing. It did work.

However, I understand that other parts are using multipart.py too. This probably won't fit the MIME standard. If I have time, I'll investigate further and provide a patch that does satisfy the MIME standard.

Attachment: utf-8_dump_load.patch

djc commented 10 years ago

From kxepal on May 14, 2011 01:20:59

Confirm. There is also invalid test case about how multipart module works with unicode data: StringIO could handle mixed "str" and "unicode" values, but files requires only "str" one.

djc commented 10 years ago

From kxepal on May 14, 2011 02:20:30

Sorry, I was wrong about tests - StringIO confused me(: Don't rush, sit down and think about...yes(: There is no needs to fix multipart module, only dump tool due to it pass unicode document id to multipart writer. This is about dump-tool.patch.

dump-tool-2.patch solves same problem, but with respect of Content-Type header and his charset. I suppose, that would a more correct solution.

Attachment: dump-tool.patch dump-tool-2.patch

djc commented 10 years ago

From heshim...@gmail.com on May 14, 2011 03:43:38

Ah, that's much smarter. Thanks!

djc commented 10 years ago

From heshim...@gmail.com on May 14, 2011 03:48:06

Hmm... another thing. I was under the impression that utf-8 encoded strings aren't valid ascii. Currently, isn't multipart.py expecting strict ascii strings as header?

djc commented 10 years ago

From kxepal on May 14, 2011 04:14:32

Actually, only first 128 chars of utf-8 encoding are valid ascii. Problem was not in what characters in headers, but in type of string multipart tries to write into output stream. Files and streams doesn't expects pure unicode strings, but favors stings called as "bytes" in Python 3 terminology and multipart module expects this behavior.

But there was a "hack" which adds to headers document id which used by couchdb-load tool to help create document with same id value. Since document id could be unicode, this "hack" breaks expectations and makes multipart crash.

You could try revert patch and replace in dump.py default value of output argument in dump_db function from sys.stdout to StringIO.StringIO and error wouldn't be occurred because StringIO could handle both str and unicode values.

djc commented 10 years ago

From djc.ochtman on May 14, 2011 05:24:10

IMO the correct way to have non-ASCII strings in MIME headers would be to use RFC 2047 encoding for any non-ascii header values.

djc commented 10 years ago

From kxepal on May 14, 2011 05:50:13

Correct, but looks like an overhead in such case, because it would applied only to one header while others should follow RFC 822. Wouldn't be better to use base64 encoding?

djc commented 10 years ago

From heshim...@gmail.com on June 01, 2011 23:47:16

Hmm... I'd like to make a note here that kxepal's dump-tool-2.patch actually generated some invalid multipart boundaries.

djc commented 10 years ago

From djc.ochtman on September 21, 2012 01:32:57

Owner: kxepal

djc commented 10 years ago

From wickedg...@gmail.com on September 21, 2012 17:44:41

Labels: Milestone-0.9

djc commented 10 years ago

From djc.ochtman on October 22, 2012 04:26:42

Any progress on this?

djc commented 10 years ago

From kxepal on October 22, 2012 04:33:00

Yes, will submit patch with tests during this week. I'd agreed with you about RFC 2047 specification, so diving into it.

djc commented 10 years ago

From kxepal on April 24, 2013 10:20:16

Patch attached. Non-ascii headers now encoded following RFC 2047. Actually, I feel to rewrite multipart module to let him base on top of email package, but probably that would be another issue - need to workaround some email specific features to keep backward compatibility.

Status: Accepted

djc commented 10 years ago

From kxepal on April 24, 2013 10:25:40

Sorry, forgot to cleanup testing prints. Reattached.

Attachment: couchdb-python_485.patch

djc commented 10 years ago

From djc.ochtman on April 25, 2013 03:09:42

Pushed a slightly changed patch as rce40fd77ae8d , thanks!

Status: Fixed

djc commented 10 years ago

From djc.ochtman on April 25, 2013 04:16:51

Labels: -Milestone-0.9

djc / couchdb-python

couchdb-dump cannot deal with unicode characters in doc ids #179