NaturalHistoryMuseum / ckanext-versioned-datastore

A CKAN extension providing a versioned datastore using MongoDB and Elasticsearch.
GNU General Public License v3.0
9 stars 3 forks source link

There are still a few character encoding errors not caught during the prep stage #15

Closed jrdh closed 5 years ago

jrdh commented 5 years ago

Examples from our staging server:

2019-08-30 15:44:36,452 INFO  [ckanext.versioned_datastore.lib.importing] Starting data import for afe28105-1633-4d93-b970-5c680a34c241 at version 1567175769966
2019-08-30 15:44:36,453 INFO  [ckanext.versioned_datastore.lib.ingestion.ingesting] Starting validation for afe28105-1633-4d93-b970-5c680a34c241
2019-08-30 15:44:36,755 INFO  [ckanext.versioned_datastore.lib.ingestion.ingesting] Prep failed for resource afe28105-1633-4d93-b970-5c680a34c241 due to UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 2: invalid start byte
2019-08-30 15:44:36,756 ERROR [ckan.lib.jobs] Job 23f3833c-1a08-486f-818b-37dba327d7cc on worker rq:worker:data-ckan-stg-1.73486 raised an exception: 'utf8' codec can't decode byte 0xa1 in position 2: invalid start byte
Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/ckanext/versioned_datastore/lib/importing.py", line 90, in import_resource_data
    request.replace, request.api_key)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/ckanext/versioned_datastore/lib/ingestion/ingesting.py", line 62, in ingest_resource
    raise e
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 2: invalid start byte
2019-08-30 15:44:35,434 INFO  [ckanext.versioned_datastore.lib.importing] Starting data import for 6c22efb5-3d3b-40d9-94a6-95eaf8a01c7a at version 1567175769555
2019-08-30 15:44:35,434 INFO  [ckanext.versioned_datastore.lib.ingestion.ingesting] Starting validation for 6c22efb5-3d3b-40d9-94a6-95eaf8a01c7a
2019-08-30 15:44:35,764 INFO  [ckanext.versioned_datastore.lib.ingestion.ingesting] Prep failed for resource 6c22efb5-3d3b-40d9-94a6-95eaf8a01c7a due to Error: new-line character seen in unquoted field - do you
need to open the file inuniversal-newline mode?
2019-08-30 15:44:35,765 ERROR [ckan.lib.jobs] Job 7c47add7-e98d-4d2d-ab19-c5bba9190bce on worker rq:worker:data-ckan-stg-1.73486 raised an exception: new-line character seen in unquoted field - do you need to op
en the file in universal-newline mode?
Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/ckanext/versioned_datastore/lib/importing.py", line 90, in import_resource_data
    request.replace, request.api_key)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/ckanext/versioned_datastore/lib/ingestion/ingesting.py", line 62, in ingest_resource
    raise e
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
2019-08-30 15:48:15,870 INFO  [ckanext.versioned_datastore.lib.importing] Starting data import for 07555c45-ed3f-4178-83a4-dfa0144e35d2 at version 1567175791432
2019-08-30 15:48:15,870 INFO  [ckanext.versioned_datastore.lib.ingestion.ingesting] Starting validation for 07555c45-ed3f-4178-83a4-dfa0144e35d2
2019-08-30 15:48:16,187 INFO  [ckanext.versioned_datastore.lib.ingestion.ingesting] Prep failed for resource 07555c45-ed3f-4178-83a4-dfa0144e35d2 due to Error: new-line character seen in unquoted field - do you
need to open the file inuniversal-newline mode?
2019-08-30 15:48:16,188 ERROR [ckan.lib.jobs] Job 8aa1758c-394e-49c8-a31b-349bc7e867ad on worker rq:worker:data-ckan-stg-1.73789 raised an exception: new-line character seen in unquoted field - do you need to o$
en the file in universal-newline mode?                                                                                                                                                                             Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/ckanext/versioned_datastore/lib/importing.py", line 90, in import_resource_data
    request.replace, request.api_key)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/ckanext/versioned_datastore/lib/ingestion/ingesting.py", line 62, in ingest_resource
    raise e
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
jrdh commented 5 years ago

afe28105-1633-4d93-b970-5c680a34c241 - still breaks because who knows what character encoding they are using, however, it now breaks during validation and shows an InvalidCharacterException because the character set is has detected clearly isn't correct so some of the characters it's finding are rubbish. Here's the datastore page for it: https://data-nlb-stg-1.nhm.ac.uk/dataset/crop-weeds-of-paraguay/resource_data/afe28105-1633-4d93-b970-5c680a34c241.

6c22efb5-3d3b-40d9-94a6-95eaf8a01c7a - works fine now that the universal line stuff is resolved. Here's the datastore page for it: https://data-nlb-stg-1.nhm.ac.uk/dataset/grounds-metaanalysis-data/resource_data/6c22efb5-3d3b-40d9-94a6-95eaf8a01c7a.

07555c45-ed3f-4178-83a4-dfa0144e35d2 - also works fine now that the universal line stuff is resolved. Here's the datastore page for it: https://data-nlb-stg-1.nhm.ac.uk/dataset/crowdsourcing-the-collection/resource_data/07555c45-ed3f-4178-83a4-dfa0144e35d2.