OpenSextant / OpenSextantToolbox

A geotagger and entity extractor
Other
15 stars 7 forks source link

unicode input #13

Closed johnrfrank closed 9 years ago

johnrfrank commented 9 years ago

This shows a toy example of posting some UTF-8 encoded non-ASCII text to OpenSextant. As you can see at the bottom, the characters \u00e7 and \u00e9 get converted to \ufffd meaning broken. Why? Is this something simple?

toy script to generate a unicode POST :

import requests import json import urllib

data = u"Fran\u00e7oise is in the city of Qu\u00e9bec.".encode('utf8') resp = requests.post("http://localhost:8182/opensextant/extract/general/json", data=data)

data = json.loads(resp.content) print json.dumps(data, indent=4, sort_keys=True) print repr(data['content'])

now run it:

$ python t.py { "annoList": [ { "end": 4, "features": { "EntityType": "PersonName", "hierarchy": "Person.name.personName", "isEntity": true, "string": "Fran" }, "matchText": "Fran", "start": 0, "type": "PersonName" }, { "end": 25, "features": { "EntityType": "FeatureType", "hierarchy": "Geo.featureType.PopulatedPlace", "isEntity": true, "string": "the city" }, "matchText": "the city", "start": 17, "type": "FeatureType" } ], "content": "Fran\ufffd\ufffdoise is in the city of Qu\ufffd\ufffdbec." } u'Fran\ufffd\ufffdoise is in the city of Qu\ufffd\ufffdbec.'

johnrfrank commented 9 years ago

I fixed this. needed to pass UTF-8 info in HTTP POST headers https://github.com/streamcorpus/streamcorpus-opensextant/commit/5fa4083c680b635cf44dfdf1298f61598c5bfb87