MapofLife / MOL

Integrating information about species distributions in an effort to support global understanding of the world's biodiversity.
http://mol.org
BSD 3-Clause "New" or "Revised" License
26 stars 2 forks source link

Getting Unicode data into CartoDB #10

Closed eightysteele closed 12 years ago

eightysteele commented 12 years ago

We have a unicode problem going from config.yaml to GeoJSON to CartoDB.

Example, in the IUCN config.yaml file:

title: "UCN Red List™ of Threatened Species"

In the GeoJSON output:

"title": "IUCN Red List\u2122 of Threatened Species"

In CartoDB:

IUCN Red List™ of Threatened Species

gaurav commented 12 years ago

As of @dd88640, GeoJSON is now being correctly written in UTF-8. Unfortunately, CartoDB is still not reading UTF-8 correctly. At first I thought that was an encoding problem at their end, but they do support UTF-8. I'll try to find the smallest bit of code that fails, and see if I can figure out what's going on there.

gaurav commented 12 years ago

I've confirmed that a small UTF-8 encoded GeoJSON file (https://gist.github.com/1374824) cannot be uploaded to CartoDB without messing up the text, and reported this to Vizzuality as http://support.cartodb.com/discussions/problems/66-geojson-import-doesnt-get-utf-8-right.

gaurav commented 12 years ago

Vizzuality have confirmed that it's a bug on their end (issue: https://github.com/Vizzuality/cartodb/issues/328). They've already implemented and deployed a fix which transcribes non-ASCII characters (i.e. "®" becomes "(R)"). This isn't good enough for Map of Life 1.0, but might be good enough for us right now. I've asked them for when this bug might be completely eliminated; if they plan to make that happen before Jan 2012, it might make sense to just wait until they fix this. If not, we can discuss working on it ourselves.

gaurav commented 12 years ago

Good news: we'll probably be able to bypass this by using the SQL API. Bad news: the Oauth2 module we're currently using doesn't support UTF-8 (at least for GET arguments). Hopefully once we get POST working (issue #13), this will Just Work. Otherwise, we'll have to fork and manage the python-oauth2 module, and nobody wants that.

eightysteele commented 12 years ago

the Oauth2 module we're currently using doesn't support UTF-8

Which module are we using? The CartoDB Python client or something else?

gaurav commented 12 years ago

Oops, sorry, I thought I'd replied here already (I'd actually replied only to issue #12). Here's my answer from there:

CartoDB (like everybody else) uses https://github.com/simplegeo/python-oauth2, which hasn't been updated since May, and which has a huge number of people submitting fixes to what looks like exactly the UTF-8 bug affecting us. Unfortunately, nobody seems to have properly forked python-oauth2 (every fork I looked at used the same namespace, 'oauth2', even where it changes the API for this module, and every one used the same version number, for bonus confusion points).

Most other developers (e.g. Google) appear to just be incorporating the module into their codebase, then modifying it for their own use (and to get rid of bugs like the UTF-8 issue). Is there an easy way to insert the OAuth codebase into cartodb-python? Otherwise, we'll just have to bite the bullet and (properly) fork python-oauth2 into github.com/mapoflife.

I'm planning to spend a bit of time in the next two days (term paper permitting) to try to work around this somehow without forking it entirely, and to see if I can develop the smallest possible test case which meets this condition. Also, it's possible that this bug too will vanish once we get POST (it's possible that only GET arguments are getting un-UTF-8'ed). So I'm not going to stress too much about this until I come up with a test case.

gaurav commented 12 years ago

I just rechecked this. At the moment, cartodb-python uses urllib.urlencode to convert the content into an application/x-www-form-urlencoded POST body, which can't handle Unicode at the moment.

Then I accidentally fixed this by using unicode_str.encode('ascii', 'xmlcharrefreplace'), which converts Unicode into its corresponding character entity (i.e. \u1234 becomes &#x1234). Much to my surprise, CartoDB converts those back into Unicode when storing it! So we've got some sort of Unicode upload working as of @5f95961!

I'm not sure whether to close this issue or leave it open until we have a bit of time to test this separately. Any suggestions?