Closed johnrfrank closed 9 years ago
This shows a toy example of posting some UTF-8 encoded non-ASCII text to OpenSextant. As you can see at the bottom, the characters \u00e7 and \u00e9 get converted to \ufffd meaning broken. Why? Is this something simple?
import requests import json import urllib
data = u"Fran\u00e7oise is in the city of Qu\u00e9bec.".encode('utf8') resp = requests.post("http://localhost:8182/opensextant/extract/general/json", data=data)
data = json.loads(resp.content) print json.dumps(data, indent=4, sort_keys=True) print repr(data['content'])
$ python t.py { "annoList": [ { "end": 4, "features": { "EntityType": "PersonName", "hierarchy": "Person.name.personName", "isEntity": true, "string": "Fran" }, "matchText": "Fran", "start": 0, "type": "PersonName" }, { "end": 25, "features": { "EntityType": "FeatureType", "hierarchy": "Geo.featureType.PopulatedPlace", "isEntity": true, "string": "the city" }, "matchText": "the city", "start": 17, "type": "FeatureType" } ], "content": "Fran\ufffd\ufffdoise is in the city of Qu\ufffd\ufffdbec." } u'Fran\ufffd\ufffdoise is in the city of Qu\ufffd\ufffdbec.'
I fixed this. needed to pass UTF-8 info in HTTP POST headers https://github.com/streamcorpus/streamcorpus-opensextant/commit/5fa4083c680b635cf44dfdf1298f61598c5bfb87
This shows a toy example of posting some UTF-8 encoded non-ASCII text to OpenSextant. As you can see at the bottom, the characters \u00e7 and \u00e9 get converted to \ufffd meaning broken. Why? Is this something simple?
toy script to generate a unicode POST :
import requests import json import urllib
data = u"Fran\u00e7oise is in the city of Qu\u00e9bec.".encode('utf8') resp = requests.post("http://localhost:8182/opensextant/extract/general/json", data=data)
data = json.loads(resp.content) print json.dumps(data, indent=4, sort_keys=True) print repr(data['content'])
now run it:
$ python t.py { "annoList": [ { "end": 4, "features": { "EntityType": "PersonName", "hierarchy": "Person.name.personName", "isEntity": true, "string": "Fran" }, "matchText": "Fran", "start": 0, "type": "PersonName" }, { "end": 25, "features": { "EntityType": "FeatureType", "hierarchy": "Geo.featureType.PopulatedPlace", "isEntity": true, "string": "the city" }, "matchText": "the city", "start": 17, "type": "FeatureType" } ], "content": "Fran\ufffd\ufffdoise is in the city of Qu\ufffd\ufffdbec." } u'Fran\ufffd\ufffdoise is in the city of Qu\ufffd\ufffdbec.'