andrewxhill / MOL

The Map of Life
mol.colorado.edu/
19 stars 4 forks source link

Unicode in config.yaml #104

Closed tucotuco closed 12 years ago

tucotuco commented 13 years ago

DictWriter is unable to writerow() for any line in config.yaml containing Unicode, giving the error "UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 110: ordinal not in range(128)." Suspicion is that PyYaml's Config class is not Unicode-ready, so reading in Unicode won't work even if writing out Unicode using the code provided in unicodewriterhelper.py.

Bottom line, probably need to read in config.yaml properly with Unicode and change DictWriter to UnicodeWriter.

@gaurav Game for this one?

gaurav commented 13 years ago

I'll look into it!

eightysteele commented 13 years ago

@gaurav: I wrote a UnicodeDictReader and UnicodeDictWriter based on examples in the Python docs. Will something like this help? https://gist.github.com/1174811

tucotuco commented 13 years ago

So, this code is pretty much already in workflow/mol-data/unicodewriterhelper.py May want to just commit any changes you have there.

On Fri, Aug 26, 2011 at 6:08 PM, eightysteele reply@reply.github.com wrote:

@gaurav: I wrote a UnicodeDictReader and UnicodeDictWriter based on examples in the Python docs. Will something like this help? https://gist.github.com/1174811

Reply to this email directly or view it on GitHub: https://github.com/andrewxhill/MOL/issues/104#issuecomment-1918203

eightysteele commented 13 years ago

So, this code is pretty much already in workflow/mol-data/unicodewriterhelper.py

Lesse, couple things there. The UnicodeReader.next() is returning lists instead of dictionaries and UnicodeWriter.writerow is takes a list instead of a dictionary. I think what we want here is DictWriter/DictReader like behavior?

tucotuco commented 13 years ago

Pushed UnicodeReader/Writer as adapted by Aaron to mol-data replacing earlier version that read and wrote lists rather than dictionaries. @7e39087 @tucotuco Needs incorporation in loader.py and testing.

gaurav commented 12 years ago

I tweaked the UnicodeDictWriter and got it to work (as of @e269cc95). The collection.csv.txt is now being correctly written in UTF-8 (as verified by Vim and TextEdit). Unfortunately, this leads to two odds pieces of behavior: JSON fields in the database into which unicode is being written have the unicode as ' \uDDDD' instead of the actual unicode character, and attempting to set field values to unicode strings result in UnicodeEncodeErrors from within the App Engine Launcher.

It sounds like my next target needs to be the bulkload_helper.py script, since (I think?) that's responsible for getting our data from the CSV files to the App Engine. Which is convenient, since I also need to work on bulkload_helper.py for issue #110. @tucotuco: if you don't need this done in a screaming hurry (i.e. early next week is okay), feel free to reassign this bug to me!

tucotuco commented 12 years ago

Fair enough. You're in there and familiar with it now. It's yours.

gaurav commented 12 years ago

I think I've got this nailed as of @634e12ef8a7. There's been a lot of changes to a lot of files, though (loader.py, metagen.py, config.yaml), so it's likely other bugs will show up as a result of these changes. If/when they do, please reopen this bug or file a new one!