cioos-siooc / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers datahub.io, catalog.data.gov and europeandataportal.eu/data/en/dataset among many other sites.
http://ckan.org/
Other
2 stars 4 forks source link

Harvesters not terminating #227

Closed sjbruce closed 1 month ago

sjbruce commented 4 months ago

CKAN version 1.6.0

Describe the bug Harvest jobs of fresh installs of CKAN 1.6.0 do not appear to be able to terminate by themselves as previous versions do.

Current job has been running for well over an hour, but it has inserted all datasets correctly.

However, the process appears to fail before the indexes are updated as the home page shows a dataset count of zero and no E*Vs are listed as having any datasets attached to them.

The datasets page does show the datasets, E*Vs, responsible organizations, tags, resources types, licenses, formats.

Map will show dataset extents and filters appear to be working properly.

Log outputs for the ckan and harvester containers are attached.

Steps to reproduce Steps to reproduce the behavior:

Expected behavior The harvester should have run and produced a set of results detailing how many datasets added, updated, deleted, etc.

Additional details image

Configuration:

{
  "default_tags": [],
  "default_extras": {
    "encoding": "utf8",
    "h_source_id": "{harvest_source_id}",
    "h_source_url": "{harvest_source_url}",
    "h_source_title": "{harvest_source_title}",
    "h_job_id": "{harvest_job_id}",
    "h_object_id": "{harvest_object_id}"
  },
  "override_extras": false,
  "clean_tags": true,
  "validator_profiles": ["iso19115"],
  "harvest_iso_categories": false
}

CKAN Container & Harvester Logs:

ckan.log ckan_harvesters.log

sjbruce commented 4 months ago

I should note that the harvester configuration above is a direct lift from the harvester configuration from a 1.5.0 deployment of CKAN

fostermh commented 4 months ago

is the ckan_run_harvester container running? Are the cron jobs in this container executing?

you can run the harvester cleanup manually by executing ckan --config=/srv/app/ckan.ini harvester run or by clocking 'stop' in the gui.

see /contrib/docker/crontab for a list of cron jobs that are run in the ckan_run_harvester container

It could be related to container permissions. the ckan_run_harvester must be run as root.

sjbruce commented 4 months ago

ckan_run_harvester is running but there don't appear to be any cron jobs running or indeed scheduled.

The docker file does have a line to copy the crontab file to the container and it is in /srv/app/src/ckan/contrib/docker but if I look at /etc/crontabs/root it simply lists the instructions to run cron jobs in /etc/periodic/ sub-directories, all of which are empty.

It doesn't look like the cron jobs are installed.

Running the command above it complains about "SECRET_KEY" which likely makes part or all of this down to not running the ckan generate config command and grabbing the appropriate key values or executing the commented out commands at the top of the .env file.

I note that those commands will fail on Windows/WSL due to some low-level nonsense on that part. I'll work around it and rebuild the containers to see if that makes a difference.

I imagine it'll let the command above run, I don't suspect it'll change anything with the cron jobs themselves.

fostermh commented 4 months ago

There is a couple of issues here.

line 20 in ckan-run-harvester-entrypoint.sh should be cat /srv/app/src/ckan/contrib/docker/crontab | crontab -

while ckan can read it's config from environment variables the command line tools do not. so in order for all the cronjob tasks to work we need to update the ckan.ini.

uncomment the following lines in your ckan.ini in the container

ckan.plugins = envvars
              stats
              text_view
              image_view
              recline_view
              datastore
              datapusher
              scheming_datasets
              scheming_organizations
              scheming_groups
              scheming_nerf_index
              fluent
              harvest
              ckan_harvester
              csw_harvester
              waf_harvester
              doc_harvester
              ckan_schema_harvester
              spatial_metadata
              spatial_query
              spatial_harvest_metadata_api
              cioos_harvest
              cioos_theme
              ckan_cioos_harvester
              dcat
              structured_data
              resource_proxy
              geo_view
              geojson_view
              wmts_view
              ckan_spatial_harvester
              datastream_harvester
              #geonetwork_harvester

#   module-path:file to schemas being used
scheming.dataset_schemas = ckanext.scheming:cioos_siooc_schema.json
scheming.presets = ckanext.scheming:presets.json
                   ckanext.fluent:presets.json
scheming.dataset_fallback = true
scheming.organization_schemas = ckanext.scheming:organization.json
scheming.group_schemas = ckanext.scheming:group.json

It is odd that the fetch and gather containers work while the run container does not... This config settings issue would also account for odd indexing problems.

fostermh commented 4 months ago

Note that there appears to be some odd behaviour when updating the frequency of a harvest job. While the change will show up in the GUI after hitting save. the time of the next harvest job run is not adjusted in the database until the next time it runs. This means that when going from weekly to always frequency, for example, the job will not be updated until the next time it runs, potentially in a week. To update sooner you will need to manually run the harvest to insure the database is updated to the new settings.