datopian / ckanext-sweden

CKAN extension for Öppnadata.se, the Swedish data management platform
GNU Affero General Public License v3.0
7 stars 4 forks source link

Show DCAT validation output #10

Open amercader opened 9 years ago

amercader commented 9 years ago

Context

On the Swedish open data portal datasets are harvested from DCAT metadata dumps like this one. This is parsed by the ckanext-dcat harvester and CKAN datasets are created.

There is a CKAN organization and a CKAN harvest source for each remote organization that has its datasets imported into CKAN.

The DCAT files are validated using an external validation service:

https://validator.dcat-editor.com/

This service only supports POST requests. For example, called with the DCAT file linked before it returns this output.

curl -X POST -d@catalog.rdf https://validator.dcat-editor.com/service

We are hooking up with the validation service at this point:

https://github.com/okfn/ckanext-sweden/blob/master/ckanext/sweden/dcat/plugin.py#L21

This is called after the remote file is downloaded and before the contents are parsed and datasets created. Note that we are returning an array with validation errors. These are stored as harvest errors, more specifically GatherErrors, linked to a Harvest Job (which is linked to a Harvest Source, linked to an Organization). For instance, these errors are displayed in the harvest report page).

What's needed

On the custom dcat_organization_list action we need a dcat_validation key in the with the value http://{host}/organization/{id}/dcat_validation

This endpoint should point to a custom action that returns the validation errors for the last harvest done for this organization (more precisely, errors occurred during the last harvest job of the organization harvest source).

The actual output can be:

  1. Something generated from the harvest errors we are already storing. Cons: we need some cumbersome queries to get the relevant harvest gather errors with just the org id (we need to link org id > harvest source > last job > gather errors). Pros: @joetsoi already did some work to store the validation errors as JSON, so that might make things easier to build the whole output.
  2. Store the whole output coming from the validation service in the database at this point and just dump whatever was returned. Pros: we don't need to worry about parsing it. Cons: we need to create a custom db table, linked to the owner org, and make sure it doesn't grow too much (eg by keeping only the most recent report per organization)
amercader commented 9 years ago

@brew, @tryggvib as per our chat before ^

amercader commented 9 years ago

@brew Your validation branch looks really good.

http://sweden.staging.ckanhosted.com/api/action/dcat_validation?id=test_organization_1

Some stuff to finish it off: