GeoNode / django-osgeo-importer

An extensible Django app that allows users to upload geospatial data.
GNU General Public License v3.0
26 stars 30 forks source link

Unicode error #55

Open jj0hns0n opened 7 years ago

jj0hns0n commented 7 years ago

In django-osgeo-importer, visit /uploads/new and click Choose File. Select the file linked below. It should continue to the next step of the upload process but halts with a UnicodeDecodeError.

https://eventkit.s3.amazonaws.com/0e473d86-6953-47d0-8dda-721b94039e1d/osm-vector/portauprince.gpkg

ghost commented 7 years ago

For supplementary info (supplemental info?) here is the traceback I got locally against Exchange.

Environment:

Request Method: POST
Request URL: http://exchange/uploads/new

Django Version: 1.8.16
Python Version: 2.7.11
Installed Applications:
('appearance',
 'flat',
 'exchange.core',
 'geonode',
 'geonode.contrib.geogig',
 'geonode.contrib.slack',
 'django_classification_banner',
 'maploom',
 'solo',
 'haystack',
 'corsheaders',
 'exchange-docs',
 'osgeo_importer',
 'modeltranslation',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.admin',
 'django.contrib.sitemaps',
 'django.contrib.staticfiles',
 'django.contrib.messages',
 'django.contrib.humanize',
 'django.contrib.gis',
 'pagination',
 'taggit',
 'treebeard',
 'friendlytagloader',
 'geoexplorer',
 'leaflet',
 'django_extensions',
 'autocomplete_light',
 'mptt',
 'djcelery',
 'storages',
 'pinax_theme_bootstrap_account',
 'pinax_theme_bootstrap',
 'django_forms_bootstrap',
 'account',
 'avatar',
 'dialogos',
 'agon_ratings',
 'announcements',
 'actstream',
 'user_messages',
 'tastypie',
 'polymorphic',
 'guardian',
 'geonode.people',
 'geonode.base',
 'geonode.layers',
 'geonode.maps',
 'geonode.proxy',
 'geonode.security',
 'geonode.social',
 'geonode.catalogue',
 'geonode.documents',
 'geonode.api',
 'geonode.groups',
 'geonode.services',
 'geonode.geoserver',
 'geonode.upload',
 'geonode.tasks')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'django.middleware.locale.LocaleMiddleware',
 'pagination.middleware.PaginationMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.middleware.clickjacking.XFrameOptionsMiddleware',
 'geonode.security.middleware.LoginRequiredMiddleware',
 'geonode.security.middleware.LoginRequiredMiddleware',
 'corsheaders.middleware.CorsMiddleware',
 'whitenoise.middleware.WhiteNoiseMiddleware')

Traceback:
File "/env/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
  132.                     response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/env/lib/python2.7/site-packages/django/contrib/auth/decorators.py" in _wrapped_view
  22.                 return view_func(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/base.py" in view
  71.             return self.dispatch(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/base.py" in dispatch
  89.         return handler(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/edit.py" in post
  215.             return self.form_valid(form)
File "/mnt/importer/osgeo_importer/views.py" in form_valid
  212.                     upload.uploadlayer_set.add(upload_layer)
File "/env/lib/python2.7/site-packages/django/db/models/fields/related.py" in add
  750.                     obj.save()
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in save
  734.                        force_update=force_update, update_fields=update_fields)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in save_base
  762.             updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in _save_table
  846.             result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in _do_insert
  885.                                using=using, raw=raw)
File "/env/lib/python2.7/site-packages/django/db/models/manager.py" in manager_method
  127.                 return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/env/lib/python2.7/site-packages/django/db/models/query.py" in _insert
  920.         return query.get_compiler(using=using).execute_sql(return_id)
File "/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py" in execute_sql
  973.             for sql, params in self.as_sql():
File "/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py" in as_sql
  931.                 for obj in self.query.objs
File "/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py" in get_db_prep_save
  710.                                       prepared=False)
File "/env/lib/python2.7/site-packages/jsonfield/fields.py" in get_db_prep_value
  89.         return self.get_prep_value(value)
File "/env/lib/python2.7/site-packages/jsonfield/fields.py" in get_prep_value
  96.         return json.dumps(value, **self.encoder_kwargs)
File "/usr/local/lib/python2.7/json/__init__.py" in dumps
  244.         return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py" in encode
  207.         chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py" in iterencode
  270.         return _iterencode(o, 0)

Exception Type: UnicodeDecodeError at /uploads/new
Exception Value: 'utf8' codec can't decode byte 0xa5 in position 16: invalid start byte
ghost commented 7 years ago

Our line of interest is probably osgeo_importer/views.py in form_valid: upload.uploadlayer_set.add(upload_layer).

Django hints The string that could not be encoded/decoded was: umber�,z.

This substring is in layer['fields'] list / description['fields'] list:

...
{'name': 'addr_housenumber\xa5,z\x7f', 'type': 'IntegerList'},
             {'name': 'addr_streetumber\xa5,z\x7f', 'type': 'Real'},
...

To be specific, the traceback says 'utf8' codec can't decode byte 0xa5 in position 16: invalid start byte - so we know that we can't just decode this blob of bytes as utf-8. What, then, is the intended encoding?

While it isn't a correct or robust way for us to handle byte blobs we receive, chardet can give us a clue what these particular bytes were intended to be:

Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> a = 'addr_housenumber\xa5,z\x7f'
>>> chardet.detect(a)
{'confidence': 0.8406238024693018, 'encoding': 'ISO-8859-2'}

That's a little funny, because Wikipedia gives the following list of languages as likely to use this character set:


    Albanian
    Bosnian
    Croatian
    Czech
    German (fully compatible with ISO/IEC 8859-1 for German texts)
    Hungarian
    Polish
    Romanian
    Serbian Latin
    Slovak
    Slovene
    Upper Sorbian
    Lower Sorbian
    Turkmen

Why can't we just use chardet? (1) some bytestrings are just nonsensical, they don't really fit in any character set. (2) Sometimes the guess is wrong, and then we will be making nonsense characters (aka mojibake).

Can we just delete the wrong characters? Well, what if there are none which aren't wrong - then do we give the field a default name? In any case, this means destroying input data. But it does seem safer to replace the whole string than to unpredictably try to salvage something from what may be totally corrupted nonsense.

The right solution would be, if the gpkg provides a clue about what encoding field names will be in, and we get GDAL to use that clue, or at worst give that clue to us so we can fix things.

If on the other hand gpkg is being abused by certain files we can find in the wild, we can feel more justified in "losing data" because the data was already lost due to someone else's mistake.

If we really have no options, then the best thing to do would be to treat field names only as bytestrings and never as characters (because the moment they are treated as characters, this encoding issue will arise again).

To sum up, in order of preference

  1. The gpkg parser handles the encoding of field names for us
  2. The gpkg parser hands us the encoding so we can decode field names
  3. We can document that this file is abusing gpkg format as far as its encoding of field names and/or its identification of their encoding, and cite this as a reason to use some nondescript default field name instead, and spit out some descriptive warning that users will notice, and document that this may happen so at least users find something when they google the issue
  4. we adopt a strict discipline of only treating field names as mere arbitrary bytestrings, never as human-readable text which must be decoded for human consumption or passed to any python callables which interpret them as human-readable txt
ghost commented 7 years ago

We could also just explicitly and verbosely barf on the whole file and reject it for the user to worry about, that's another option.

JivanAmara commented 7 years ago

This is looking like a gpkg creation error rather than encoding confusion. I vote for rejecting the file with some actionable details for the user.

ingenieroariel commented 7 years ago

In this conversation about rejected uploads I'd like to chime in and share some of my experiences after talking to GeoNode users.

In Indonesia, after trying to upload a 500mb raster file for days and getting timeouts a user told me the file was finally uploaded and then rejected because it did not come with a valid projection file.

That led me to think that in many cases the user is not able to resolve the problems themselves and will need to ask for help from the system administrator, in those cases we can help them by doing the following:

  1. Having a basic workflow in which users upload spatial files and we let it through but do not have it active yet.
  2. The end user can share that partial upload with other users or admins until they figure out a way to enable the layer (without a reupload when possible)
  3. after all conflicts have been fixed it becomes active like a regular layer.

I am always a fan of always enabling layers that are fully functional (can automatically be turned into a slippy map), but GeoNode users seem to want GeoNode to also help them with the non individual curation aspect.

Is there a way to reject files without requiring a re-upload at a later time?

On Thu, Dec 1, 2016 at 8:57 PM, Jivan Amara notifications@github.com wrote:

This is looking like a gpkg creation error rather than encoding confusion. I vote for rejecting the file with some actionable details for the user.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GeoNode/django-osgeo-importer/issues/55#issuecomment-264352711, or mute the thread https://github.com/notifications/unsubscribe-auth/AADW15y6FQyLk5jFQiOEqGj4dY7aTgwLks5rD3sngaJpZM4Kmckl .