ckan / ckanext-harvest

Remote harvesting extension for CKAN
130 stars 203 forks source link

WAF with subdirectories fails to harvest? #494

Closed bonnland closed 2 years ago

bonnland commented 2 years ago

Hi,

I'm trying to harvest from a WAF with subdirectories that contain ISO XML files. If I delete the subdirectories, harvesting seems to work, but if subdirectories exist, I'm getting the following error.

Am I doing anything wrong? I just switched from CKAN 2.8.7 to 2.9.5 (using python 3.8) and the WAF harvesting was working with python 2.7.

Thanks for any help!

(default) [ckan@localhost ~]$ ckan -c /etc/ckan/default/development.ini harvester run-test mini-waf
2022-04-01 22:22:58,972 INFO  [ckan.cli] Using configuration file /etc/ckan/default/development.ini
2022-04-01 22:22:58,972 INFO  [ckan.config.environment] Loading static files from public
2022-04-01 22:22:58,998 INFO  [ckan.config.environment] Loading templates from /usr/lib/ckan/default/src/ckan/ckan/templates
2022-04-01 22:22:59,426 INFO  [ckan.config.environment] Loading templates from /usr/lib/ckan/default/src/ckan/ckan/templates
2022-04-01 22:22:59,439 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2022-04-01 22:22:59,446 DEBUG [ckanext.harvest.model] Harvest tables already exist
2022-04-01 22:22:59,453 DEBUG [ckanext.spatial.plugin] Setting up the spatial model
2022-04-01 22:22:59,481 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2022-04-01 22:22:59,490 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2022-04-01 22:22:59,568 CRITI [ckan.lib.uploader] Please specify a ckan.storage_path in your config
                         for your uploads
2022-04-01 22:22:59,959 INFO  [ckanext.harvest.logic.action.create] Harvest job create: {'source_id': '307c55f6-f864-433b-9e7b-f39dd53c5d3b'}
2022-04-01 22:22:59,976 INFO  [ckanext.harvest.logic.action.create] Harvest job saved 98883ef4-68f1-4bf9-b455-ba9b1eb813e5
2022-04-01 22:22:59,982 INFO  [ckanext.harvest.logic.action.update] Send job to gather queue: {'id': '98883ef4-68f1-4bf9-b455-ba9b1eb813e5'}
2022-04-01 22:23:00,024 INFO  [ckanext.harvest.logic.action.update] Sent job 98883ef4-68f1-4bf9-b455-ba9b1eb813e5 to the gather queue
2022-04-01 22:23:00,040 DEBUG [ckanext.spatial.harvesters.waf.WAF.gather] WafHarvester gather_stage for job: <HarvestJob id=98883ef4-68f1-4bf9-b455-ba9b1eb813e5 created=2022-04-01 22:22:59.973833 gather_started=2022-04-01 22:23:00.040467 gather_finished=None finished=None source_id=307c55f6-f864-433b-9e7b-f39dd53c5d3b status=Running>
2022-04-01 22:23:00,042 DEBUG [ckanext.spatial.harvesters.base] Using config: {'user': 'harvest', 'read_only': True}
2022-04-01 22:23:00,074 DEBUG [ckanext.spatial.harvesters.waf] WAF new_url: http://localhost:9000/sagedev-dset-harvest-test/ckan_export/
Traceback (most recent call last):
  File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 277, in _extract_waf
    parsed = scrapers[scraper].parseString(content)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/pyparsing/core.py", line 1134, in parse_string
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 101, in gather_stage
    for url, modified_date in _extract_waf(six.text_type(content),source_url,scraper):
  File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 308, in _extract_waf
    _extract_waf(content, new_url, scraper, results, new_depth)
  File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 279, in _extract_waf
    parsed = scrapers['other'].parseString(content)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/pyparsing/core.py", line 1134, in parse_string
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/ckan", line 33, in <module>
    sys.exit(load_entry_point('ckan', 'console_scripts', 'ckan')())
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/cli.py", line 284, in run_test
    utils.run_test_harvester(id, force_import)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/utils.py", line 422, in run_test_harvester
    lib.run_harvest_job(job_obj, harvester)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/tests/lib.py", line 36, in run_harvest_job
    obj_ids = queue.gather_stage(harvester, job)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 432, in gather_stage
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/waf.py", line 104, in gather_stage
    msg = 'Error extracting URLs from %s, error was %s' % (source_url, e)
  File "/usr/lib/ckan/default/lib64/python3.8/site-packages/pyparsing/exceptions.py", line 149, in __str__
    found_match = _exception_word_extractor.match(self.pstr, self.loc)
TypeError: cannot use a string pattern on a bytes-like object
bonnland commented 2 years ago

I just realized that waf.py is in the ckanext-spatial plugin, so I should post there instead.