ckan / datapusher

A standalone web service that pushes data files from a CKAN site resources into its DataStore
GNU Affero General Public License v3.0
77 stars 155 forks source link

UnicodeEncodeError: 'ascii' codec can't encode.... (includes test dataset) #93

Open antitoxic opened 8 years ago

antitoxic commented 8 years ago

@amercader we (the civic hackers pushing the Bulgarian opendata portal) are closely following your work. And thank you. For the datapusher and the other tools you're working on.

We have a show-stopper problem for using the datapusher and it's quite common. It could be the data. It might be strangely formatted or simply because it's cyrillic. Please give us a hit.

This is the dataset the error occurs: http://opendata.obshtestvo.bg/dataset/spisak-na-razprostranitelite-ne-vinetni-stikeri

This is our staging server. You can play about and not worry about data or crushing.

This is what we get in the DataStore tab:

Error: CKAN DataStore bad response. Status code: 500 Internal Server Error. At: http://opendata.obshtestvo.bg/api/3/action/datastore_create. 
HTTP status code: 500 
Response: <html> <head> <title>Server Error</title> </head> <body> <h1>Server Error</h1> An internal server error occurred </body> </html> 
Requested URL: http://opendata.obshtestvo.bg/api/3/action/datastore_create

with:

Fetching from: http://opendata.obshtestvo.bg/dataset/08b276a7-915f-43aa-babc-9f0a9a4f7fc0/resource/642392b2-9b96-42bd-9f16-e9bce274c308/download/spisak-na-razprostranitelite-na-vinetni-stikeri20150630.csv

Btw, what is "Determined headers and types". It's giving also giving us this:

level
INFO
timestamp
2016-01-28T21:38:43.591342
module
jobs
funcName
push_to_datastore
lineno
377
message
Determined headers and types: [{'type': u'text', 'id': u'690,"\u0411\u041f 6007 \u041a\u044a\u0440\u0434\u0436\u0430\u043b\u0438 7","\u041a\u044a\u0440\u0434\u0436\u0430\u043b\u0438","\u041a\u044a\u0440\u0434\u0436\u0430\u043b\u0438","\u041a\u044a\u0440\u0434\u0436\u0430\u043b\u0438 7 \u043a\u0432. ""\u0412\u044a\u0437\u0440\u043e\u0436\u0434\u0435\u043d\u0446\u0438"" \u0431\u043b. 2","0361/65582 ","8-12;"13-16.30""'}, {'type': u'text', 'id': u'\u043d\u0435 \u0440\u0430\u0431\u043e\u0442\u0438""'}, {'type': u'text', 'id': u'\u043d\u0435 \u0440\u0430\u0431\u043e\u0442\u0438""";'}]

The escaped version of the "Determined headers and types" is:

[{'id': '690,"БП 6007 Кърджали 7","Кърджали","Кърджали","Кърджали 7 кв. ""Възрожденци"" бл. 2","0361/65582 ","8-12;"13-16.30""', 'type': 'text'}, {'id': 'не работи""', 'type': 'text'}, {'id': 'не работи""";', 'type': 'text'}]

Seems like the CSV is encoded in Windwows-1251. Is this the problem? Interesting thing is that that data in "Determined headers and types" is actually line number 693. Why is this line determined as headers?

This is the stack:

URL: http://opendata.obshtestvo.bg/api/3/action/datastore_create
File '/usr/lib/ckan/default/lib/python2.7/site-packages/weberror/errormiddleware.py', line 171 in __call__
  app_iter = self.application(environ, sr_checker)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/dec.py', line 147 in __call__
  resp = self.call_func(req, *args, **self.kwargs)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/dec.py', line 208 in call_func
  return self.func(req, *args, **kwargs)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/fanstatic/publisher.py', line 234 in __call__
  return request.get_response(self.app)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/request.py', line 1053 in get_response
  application, catch_exc_info=False)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/request.py', line 1022 in call_application
  app_iter = application(self.environ, start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/dec.py', line 147 in __call__
  resp = self.call_func(req, *args, **self.kwargs)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/dec.py', line 208 in call_func
  return self.func(req, *args, **kwargs)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/fanstatic/injector.py', line 54 in __call__
  response = request.get_response(self.app)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/request.py', line 1053 in get_response
  application, catch_exc_info=False)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/webob/request.py', line 1022 in call_application
  app_iter = application(self.environ, start_response)
File '/usr/lib/ckan/default/src/ckan/ckan/config/middleware.py', line 389 in inner
  result = application(environ, start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/beaker/middleware.py', line 73 in __call__
  return self.app(environ, start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/beaker/middleware.py', line 155 in __call__
  return self.wrap_app(environ, session_start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/routes/middleware.py', line 131 in __call__
  response = self.app(environ, start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/pylons/wsgiapp.py', line 125 in __call__
  response = self.dispatch(controller, environ, start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/pylons/wsgiapp.py', line 324 in dispatch
  return controller(environ, start_response)
File '/usr/lib/ckan/default/src/ckan/ckan/controllers/api.py', line 70 in __call__
  return base.BaseController.__call__(self, environ, start_response)
File '/usr/lib/ckan/default/src/ckan/ckan/lib/base.py', line 337 in __call__
  res = WSGIController.__call__(self, environ, start_response)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/pylons/controllers/core.py', line 221 in __call__
  response = self._dispatch_call()
File '/usr/lib/ckan/default/lib/python2.7/site-packages/pylons/controllers/core.py', line 172 in _dispatch_call
  response = self._inspect_call(func)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/pylons/controllers/core.py', line 107 in _inspect_call
  result = self._perform_call(func, args)
File '/usr/lib/ckan/default/lib/python2.7/site-packages/pylons/controllers/core.py', line 60 in _perform_call
  return func(**args)
File '/usr/lib/ckan/default/src/ckan/ckan/controllers/api.py', line 205 in action
  result = function(context, request_data)
File '/usr/lib/ckan/default/src/ckan/ckan/logic/__init__.py', line 416 in wrapped
  result = _action(context, data_dict, **kw)
File '/usr/lib/ckan/default/src/ckan/ckanext/datastore/logic/action.py', line 141 in datastore_create
  result = db.create(context, data_dict)
File '/usr/lib/ckan/default/src/ckan/ckanext/datastore/db.py', line 1071 in create
  create_table(context, data_dict)
File '/usr/lib/ckan/default/src/ckan/ckanext/datastore/db.py', line 306 in create_table
  check_fields(context, supplied_fields)
File '/usr/lib/ckan/default/src/ckan/ckanext/datastore/db.py', line 273 in check_fields
  field['id'])]
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128)

CGI Variables
-------------
  CKAN_CURRENT_URL: '/api/3/action/datastore_create'
  CKAN_LANG: 'en'
  CKAN_LANG_IS_DEFAULT: True
  CONTENT_LENGTH: '242477'
  CONTENT_TYPE: 'application/json; charset=utf-8'
  DOCUMENT_ROOT: '/var/www'
  GATEWAY_INTERFACE: 'CGI/1.1'
  HTTP_ACCEPT: '*/*'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_AUTHORIZATION: 'e2067dcf-c6d5-4c3b-9ee9-f61729daf378'
  HTTP_CONNECTION: 'close'
  HTTP_HOST: 'opendata.obshtestvo.bg'
  HTTP_USER_AGENT: 'python-requests/2.7.0 CPython/2.7.3 Linux/3.16.0-4-amd64'
  HTTP_X_FORWARDED_FOR: '10.255.0.1'
  PATH_INFO: '/api/3/action/datastore_create'
  PATH_TRANSLATED: '/etc/ckan/default/opendatabulgaria.wsgi/api/3/action/datastore_create'
  REMOTE_ADDR: '10.255.0.1'
  REMOTE_PORT: '58505'
  REQUEST_METHOD: 'POST'
  REQUEST_URI: '/api/3/action/datastore_create'
  SCRIPT_FILENAME: '/etc/ckan/default/opendatabulgaria.wsgi'
  SERVER_ADDR: '127.0.0.1'
  SERVER_ADMIN: '[no address given]'
  SERVER_NAME: 'opendata.obshtestvo.bg'
  SERVER_PORT: '80'
  SERVER_PROTOCOL: 'HTTP/1.0'
  SERVER_SIGNATURE: '<address>Apache/2.2.22 (Debian) Server at opendata.obshtestvo.bg Port 80</address>\n'
  SERVER_SOFTWARE: 'Apache/2.2.22 (Debian)'

WSGI Variables
--------------
  application: <fanstatic.publisher.Delegator object at 0x7ff7d6c59290>
  beaker.cache: <beaker.cache.CacheManager object at 0x7ff7d6c59310>
  beaker.get_session: <bound method SessionMiddleware._get_session of <beaker.middleware.SessionMiddleware object at 0x7ff7d6507990>>
  beaker.session: {'_accessed_time': 1454008286.581036, '_creation_time': 1454008286.581036}
  fanstatic.needed: <fanstatic.core.NeededResources object at 0x7ff7d88f99d0>
  mod_wsgi.application_group: 'opendata.obshtestvo.bg|'
  mod_wsgi.callable_object: 'application'
  mod_wsgi.handler_script: ''
  mod_wsgi.input_chunked: '0'
  mod_wsgi.listener_host: ''
  mod_wsgi.listener_port: '8080'
  mod_wsgi.process_group: 'opendatabulgaria'
  mod_wsgi.request_handler: 'wsgi-script'
  mod_wsgi.script_reloading: '1'
  mod_wsgi.version: (3, 3)
  paste.cookies: (<SimpleCookie: >, '')
  paste.registry: <paste.registry.Registry object at 0x7ff7d88e46d0>
  paste.throw_errors: True
  pylons.action_method: <bound method ApiController.action of <ckan.controllers.api.ApiController object at 0x7ff7d899f850>>
  pylons.controller: <ckan.controllers.api.ApiController object at 0x7ff7d899f850>
  pylons.environ_config: {'session': 'beaker.session', 'cache': 'beaker.cache'}
  pylons.pylons: <pylons.util.PylonsContext object at 0x7ff7d88ef250>
  pylons.routes_dict: {'action': u'action', 'controller': u'api', 'ver': 3, 'logic_function': u'datastore_create'}
  pylons.status_code_redirect: True
  repoze.who.api: <repoze.who.api.API object at 0x7ff7d88e4d90>
  repoze.who.logger: <logging.Logger object at 0x7ff7d6507610>
  repoze.who.plugins: {'ckan.lib.authenticator:UsernamePasswordAuthenticator': <ckan.lib.authenticator.UsernamePasswordAuthenticator object at 0x7ff7d6c59850>, 'friendlyform': <FriendlyFormPlugin 140702429166864>, 'auth_tkt': <CkanAuthTktCookiePlugin 140702429166416>}
  routes.route: <routes.route.Route object at 0x7ff7d695bb10>
  routes.url: <routes.util.URLGenerator object at 0x7ff7d88ef3d0>
  webob._parsed_query_vars: (GET([]), '')
  webob.adhoc_attrs: {'response': <Response at 0x7ff7d88ef050 200 OK>, 'language': 'en-us'}
  webob.is_body_seekable: True
  wsgi process: 'Multi process AND threads (?)'
  wsgi.file_wrapper: <built-in method file_wrapper of mod_wsgi.Adapter object at 0x7ff7d88e6468>
  wsgi.version: (1, 1)
  wsgiorg.routing_args: (<routes.util.URLGenerator object at 0x7ff7d88ef3d0>, {'action': u'action', 'controller': u'api', 'ver': 3, 'logic_function': u'datastore_create'})
Vanuan commented 6 years ago

Having the same issue. CSV file is in UTF-8, so it shouldn't be an issue. I suspect the issue is that headers are partially ascii, partially utf-8. Need to check

MasaGon commented 1 year ago

For Python 2, I guess you have to fix str(header) to header.encode('utf-8').

-    headers = [str(header) for header in headers]
+    headers = [header.encode('utf-8') for header in headers]