ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
168 stars 148 forks source link

Harvester Crashes when JSON Harvester Crashes... #164

Open gallexme opened 5 years ago

gallexme commented 5 years ago

2019-10-10 12:28:13,045 DEBUG [ckanext.dcat.harvesters._json] In DCATJSONHarvester import_stage Traceback (most recent call last): File "/usr/lib/ckan/default/bin/paster", line 10, in sys.exit(run()) File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 102, in run invoke(command, command_name, options, args[1:]) File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke exit_code = runner.run(args) File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 236, in run result = self.command() File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 192, in command fetch_callback(consumer, method, header, body) File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 418, in fetch_callback fetch_and_import_stages(harvester, obj) File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 436, in fetch_and_import_stages success_import = harvester.import_stage(obj) File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters/_json.py", line 225, in import_stage self._get_package_name(harvest_object, package_dict['title']) File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters/base.py", line 134, in _get_package_name name = self._gen_new_name(title) File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/harvesters/base.py", line 90, in _gen_new_name ideal_name = munge_title_to_name(title) File "/usr/lib/ckan/default/src/ckan/ckan/lib/munge.py", line 47, in munge_title_to_name name = re.sub('[ .:/]', '-', name) File "/usr/lib/ckan/default/lib/python2.7/re.py", line 155, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer

should instead be handled and continue on with next harvest source dont u think?

gallexme commented 4 years ago

Issue Still exists

gallexme commented 4 years ago

example json:

{u'dcat:contactPoint': [], u'dcat:keyword': [u'Bev\xf6lkerung'], u'dct:issued': u'2019-02-25T15:01:27+01:00', u'dct:title': u'Nat\xfcrliche und r\xe4umliche Bewegungen ', u'dct:modified': u'2019-02-25T15:0
1:27+01:00', u'dcat:Distribution': [{u'dcat:byteSize': u'2099', u'dct:issued': u'2019-02-25T15:03:36+01:00', u'dct:title': u'Bewegungen 2018 [CSV]', u'foaf:page': u'https://opendata-duisburg.de/dataset/nat
%C3%BCrliche-und-r%C3%A4umliche-bewegungen/resource/e5a0233a-0d9c-4fca-a58f-35a9bcfc0022', u'dct:modified': u'2019-04-17T12:08:53+02:00', u'dcat:accessURL': u'https://opendata-duisburg.de/dataset/nat%C3%BC
rliche-und-r%C3%A4umliche-bewegungen/resource/e5a0233a-0d9c-4fca-a58f-35a9bcfc0022', u'dct:description': u'<p><strong>Stand:</strong> 31.12.2018</p>\n', u'dcat:mediaType': u'text/csv', u'dcat:downloadURL':
 u'https://opendata-duisburg.de/sites/default/files/BEWo2018_1.csv', u'dct:format': u'csv'}, {u'dcat:byteSize': u'', u'dct:issued': u'2019-02-26T11:41:25+01:00', u'dct:title': u'Bewegungen 2018 [JSON]', u'
foaf:page': u'https://opendata-duisburg.de/dataset/nat%C3%BCrliche-und-r%C3%A4umliche-bewegungen/resource/dba0648a-6a6c-45d7-bf13-0453f922202b', u'dct:modified': u'2019-02-26T11:41:36+01:00', u'dcat:access
URL': u'https://opendata-duisburg.de/dataset/nat%C3%BCrliche-und-r%C3%A4umliche-bewegungen/resource/dba0648a-6a6c-45d7-bf13-0453f922202b', u'dct:description': u'<p><strong>Stand:</strong> 31.12.2018</p>\n'
, u'dcat:mediaType': u'', u'dcat:downloadURL': u'', u'dct:format': u'json'}], u'dct:description': u'<p>Geburten, Sterbef\xe4lle, Fortz\xfcge, Zuz\xfcge und Umz\xfcge</p>\n<p><strong>Gebietsgliederung:</strong> Ortsteilsebene</p>\n<p><strong>Quelle:</strong> Einwohnermeldedatei; Auswertung Stabstelle f\xfcr Wahlen und Informationslogistik</p>\n', u'dct:identifier': u'444a16f2-cdd5-4030-9517-e89f0eeb9175', u'@rdf:about': u'https://opendata-duisburg.de/dataset/nat%C3%BCrliche-und-r%C3%A4umliche-bewegungen', u'dct:spatial': u'To Big To Post here Atleast 100kb',  'dct:publisher': u'Stadt Duisburg'}
gallexme commented 4 years ago

i know the json is wrong for the harvester, but it shouldnt cause other harvest jobs to not be processed at all beeing stuck in "running" for months