Open jj0hns0n opened 8 years ago
For supplementary info (supplemental info?) here is the traceback I got locally against Exchange.
Environment:
Request Method: POST
Request URL: http://exchange/uploads/new
Django Version: 1.8.16
Python Version: 2.7.11
Installed Applications:
('appearance',
'flat',
'exchange.core',
'geonode',
'geonode.contrib.geogig',
'geonode.contrib.slack',
'django_classification_banner',
'maploom',
'solo',
'haystack',
'corsheaders',
'exchange-docs',
'osgeo_importer',
'modeltranslation',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
'django.contrib.admin',
'django.contrib.sitemaps',
'django.contrib.staticfiles',
'django.contrib.messages',
'django.contrib.humanize',
'django.contrib.gis',
'pagination',
'taggit',
'treebeard',
'friendlytagloader',
'geoexplorer',
'leaflet',
'django_extensions',
'autocomplete_light',
'mptt',
'djcelery',
'storages',
'pinax_theme_bootstrap_account',
'pinax_theme_bootstrap',
'django_forms_bootstrap',
'account',
'avatar',
'dialogos',
'agon_ratings',
'announcements',
'actstream',
'user_messages',
'tastypie',
'polymorphic',
'guardian',
'geonode.people',
'geonode.base',
'geonode.layers',
'geonode.maps',
'geonode.proxy',
'geonode.security',
'geonode.social',
'geonode.catalogue',
'geonode.documents',
'geonode.api',
'geonode.groups',
'geonode.services',
'geonode.geoserver',
'geonode.upload',
'geonode.tasks')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.middleware.locale.LocaleMiddleware',
'pagination.middleware.PaginationMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.middleware.clickjacking.XFrameOptionsMiddleware',
'geonode.security.middleware.LoginRequiredMiddleware',
'geonode.security.middleware.LoginRequiredMiddleware',
'corsheaders.middleware.CorsMiddleware',
'whitenoise.middleware.WhiteNoiseMiddleware')
Traceback:
File "/env/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
132. response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/env/lib/python2.7/site-packages/django/contrib/auth/decorators.py" in _wrapped_view
22. return view_func(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/base.py" in view
71. return self.dispatch(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/base.py" in dispatch
89. return handler(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/edit.py" in post
215. return self.form_valid(form)
File "/mnt/importer/osgeo_importer/views.py" in form_valid
212. upload.uploadlayer_set.add(upload_layer)
File "/env/lib/python2.7/site-packages/django/db/models/fields/related.py" in add
750. obj.save()
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in save
734. force_update=force_update, update_fields=update_fields)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in save_base
762. updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in _save_table
846. result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in _do_insert
885. using=using, raw=raw)
File "/env/lib/python2.7/site-packages/django/db/models/manager.py" in manager_method
127. return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/env/lib/python2.7/site-packages/django/db/models/query.py" in _insert
920. return query.get_compiler(using=using).execute_sql(return_id)
File "/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py" in execute_sql
973. for sql, params in self.as_sql():
File "/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py" in as_sql
931. for obj in self.query.objs
File "/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py" in get_db_prep_save
710. prepared=False)
File "/env/lib/python2.7/site-packages/jsonfield/fields.py" in get_db_prep_value
89. return self.get_prep_value(value)
File "/env/lib/python2.7/site-packages/jsonfield/fields.py" in get_prep_value
96. return json.dumps(value, **self.encoder_kwargs)
File "/usr/local/lib/python2.7/json/__init__.py" in dumps
244. return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py" in encode
207. chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py" in iterencode
270. return _iterencode(o, 0)
Exception Type: UnicodeDecodeError at /uploads/new
Exception Value: 'utf8' codec can't decode byte 0xa5 in position 16: invalid start byte
Our line of interest is probably osgeo_importer/views.py
in form_valid
: upload.uploadlayer_set.add(upload_layer)
.
Django hints The string that could not be encoded/decoded was: umber�,z
.
This substring is in layer['fields']
list / description['fields']
list:
...
{'name': 'addr_housenumber\xa5,z\x7f', 'type': 'IntegerList'},
{'name': 'addr_streetumber\xa5,z\x7f', 'type': 'Real'},
...
To be specific, the traceback says 'utf8' codec can't decode byte 0xa5 in position 16: invalid start byte
- so we know that we can't just decode this blob of bytes as utf-8. What, then, is the intended encoding?
While it isn't a correct or robust way for us to handle byte blobs we receive, chardet
can give us a clue what these particular bytes were intended to be:
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> a = 'addr_housenumber\xa5,z\x7f'
>>> chardet.detect(a)
{'confidence': 0.8406238024693018, 'encoding': 'ISO-8859-2'}
That's a little funny, because Wikipedia gives the following list of languages as likely to use this character set:
Albanian
Bosnian
Croatian
Czech
German (fully compatible with ISO/IEC 8859-1 for German texts)
Hungarian
Polish
Romanian
Serbian Latin
Slovak
Slovene
Upper Sorbian
Lower Sorbian
Turkmen
Why can't we just use chardet
? (1) some bytestrings are just nonsensical, they don't really fit in any character set. (2) Sometimes the guess is wrong, and then we will be making nonsense characters (aka mojibake).
Can we just delete the wrong characters? Well, what if there are none which aren't wrong - then do we give the field a default name? In any case, this means destroying input data. But it does seem safer to replace the whole string than to unpredictably try to salvage something from what may be totally corrupted nonsense.
The right solution would be, if the gpkg provides a clue about what encoding field names will be in, and we get GDAL to use that clue, or at worst give that clue to us so we can fix things.
If on the other hand gpkg is being abused by certain files we can find in the wild, we can feel more justified in "losing data" because the data was already lost due to someone else's mistake.
If we really have no options, then the best thing to do would be to treat field names only as bytestrings and never as characters (because the moment they are treated as characters, this encoding issue will arise again).
To sum up, in order of preference
We could also just explicitly and verbosely barf on the whole file and reject it for the user to worry about, that's another option.
This is looking like a gpkg creation error rather than encoding confusion. I vote for rejecting the file with some actionable details for the user.
In this conversation about rejected uploads I'd like to chime in and share some of my experiences after talking to GeoNode users.
In Indonesia, after trying to upload a 500mb raster file for days and getting timeouts a user told me the file was finally uploaded and then rejected because it did not come with a valid projection file.
That led me to think that in many cases the user is not able to resolve the problems themselves and will need to ask for help from the system administrator, in those cases we can help them by doing the following:
I am always a fan of always enabling layers that are fully functional (can automatically be turned into a slippy map), but GeoNode users seem to want GeoNode to also help them with the non individual curation aspect.
Is there a way to reject files without requiring a re-upload at a later time?
On Thu, Dec 1, 2016 at 8:57 PM, Jivan Amara notifications@github.com wrote:
This is looking like a gpkg creation error rather than encoding confusion. I vote for rejecting the file with some actionable details for the user.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GeoNode/django-osgeo-importer/issues/55#issuecomment-264352711, or mute the thread https://github.com/notifications/unsubscribe-auth/AADW15y6FQyLk5jFQiOEqGj4dY7aTgwLks5rD3sngaJpZM4Kmckl .
In django-osgeo-importer, visit /uploads/new and click Choose File. Select the file linked below. It should continue to the next step of the upload process but halts with a UnicodeDecodeError.
https://eventkit.s3.amazonaws.com/0e473d86-6953-47d0-8dda-721b94039e1d/osm-vector/portauprince.gpkg