I'm trying to harvest from a WAF with subdirectories that contain ISO XML files. If I delete the subdirectories, harvesting seems to work, but if subdirectories exist, I'm getting the following error.
Am I doing anything wrong? I just switched from CKAN 2.8.7 to 2.9.5 (using python 3.8) and the WAF harvesting was working with python 2.7.
Thanks for any help!
(default) [ckan@localhost ~]$ ckan -c /etc/ckan/default/development.ini harvester run-test mini-waf
2022-04-01 22:22:58,972 INFO [ckan.cli] Using configuration file /etc/ckan/default/development.ini
2022-04-01 22:22:58,972 INFO [ckan.config.environment] Loading static files from public
2022-04-01 22:22:58,998 INFO [ckan.config.environment] Loading templates from /usr/lib/ckan/default/src/ckan/ckan/templates
2022-04-01 22:22:59,426 INFO [ckan.config.environment] Loading templates from /usr/lib/ckan/default/src/ckan/ckan/templates
2022-04-01 22:22:59,439 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2022-04-01 22:22:59,446 DEBUG [ckanext.harvest.model] Harvest tables already exist
2022-04-01 22:22:59,453 DEBUG [ckanext.spatial.plugin] Setting up the spatial model
2022-04-01 22:22:59,481 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2022-04-01 22:22:59,490 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2022-04-01 22:22:59,568 CRITI [ckan.lib.uploader] Please specify a ckan.storage_path in your config
for your uploads
2022-04-01 22:22:59,959 INFO [ckanext.harvest.logic.action.create] Harvest job create: {'source_id': '307c55f6-f864-433b-9e7b-f39dd53c5d3b'}
2022-04-01 22:22:59,976 INFO [ckanext.harvest.logic.action.create] Harvest job saved 98883ef4-68f1-4bf9-b455-ba9b1eb813e5
2022-04-01 22:22:59,982 INFO [ckanext.harvest.logic.action.update] Send job to gather queue: {'id': '98883ef4-68f1-4bf9-b455-ba9b1eb813e5'}
2022-04-01 22:23:00,024 INFO [ckanext.harvest.logic.action.update] Sent job 98883ef4-68f1-4bf9-b455-ba9b1eb813e5 to the gather queue
2022-04-01 22:23:00,040 DEBUG [ckanext.spatial.harvesters.waf.WAF.gather] WafHarvester gather_stage for job: <HarvestJob id=98883ef4-68f1-4bf9-b455-ba9b1eb813e5 created=2022-04-01 22:22:59.973833 gather_started=2022-04-01 22:23:00.040467 gather_finished=None finished=None source_id=307c55f6-f864-433b-9e7b-f39dd53c5d3b status=Running>
2022-04-01 22:23:00,042 DEBUG [ckanext.spatial.harvesters.base] Using config: {'user': 'harvest', 'read_only': True}
2022-04-01 22:23:00,074 DEBUG [ckanext.spatial.harvesters.waf] WAF new_url: http://localhost:9000/sagedev-dset-harvest-test/ckan_export/
Traceback (most recent call last):
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 277, in _extract_waf
parsed = scrapers[scraper].parseString(content)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/pyparsing/core.py", line 1134, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: <exception str() failed>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 101, in gather_stage
for url, modified_date in _extract_waf(six.text_type(content),source_url,scraper):
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 308, in _extract_waf
_extract_waf(content, new_url, scraper, results, new_depth)
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 279, in _extract_waf
parsed = scrapers['other'].parseString(content)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/pyparsing/core.py", line 1134, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: <exception str() failed>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/ckan/default/bin/ckan", line 33, in <module>
sys.exit(load_entry_point('ckan', 'console_scripts', 'ckan')())
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/cli.py", line 284, in run_test
utils.run_test_harvester(id, force_import)
File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/utils.py", line 422, in run_test_harvester
lib.run_harvest_job(job_obj, harvester)
File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/tests/lib.py", line 36, in run_harvest_job
obj_ids = queue.gather_stage(harvester, job)
File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 432, in gather_stage
harvest_object_ids = harvester.gather_stage(job)
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 104, in gather_stage
msg = 'Error extracting URLs from %s, error was %s' % (source_url, e)
File "/usr/lib/ckan/default/lib64/python3.8/site-packages/pyparsing/exceptions.py", line 149, in __str__
found_match = _exception_word_extractor.match(self.pstr, self.loc)
TypeError: cannot use a string pattern on a bytes-like object
Hi,
I'm trying to harvest from a WAF with subdirectories that contain ISO XML files. If I delete the subdirectories, harvesting seems to work, but if subdirectories exist, I'm getting the following error.
Am I doing anything wrong? I just switched from CKAN 2.8.7 to 2.9.5 (using python 3.8) and the WAF harvesting was working with python 2.7.
Thanks for any help!