frictionlessdata / pilot-ukds

Pilot UKDS
MIT License
1 stars 0 forks source link

UKDS Pipeline Pilot

Travis Coveralls

An example Data Package Pipeline to harvest data from UKDS, transform, validate, define visualizations, and import into datahub.io.

As well as the pipeline, this repository maintains a pipeline processor to add OAI-PMH dataset metadata to the datapackage: ukds.add_oai_metadata. View specs can be added to a datapackage with the ukds.add_datapackage_views processor.

The following pipeline plugins are also used by the pipeline:

The basic flow from the UKDS Reshare resource, to the final datahub.io entry is outlined in the diagram below:

Pipeline flow

The source-spec is defined in /entries/ukds.source-spec.yaml:

oai-url:
  http://reshare.ukdataservice.ac.uk/cgi/oai2

entries:
  my-first-entry:  # a remote spss resource
    source:
      - url: http://reshare.ukdataservice.ac.uk/851500/2/my-spss-file.sav
        format: spss
    oai-id: 851500

  my-second-entry:  # a local csv resource with a view
    source:
      - url: ../data/my-csv-file.csv
        format: csv
        tabulator:
          headers: 1
    views:
      - views/my-views-spec.json

  my-multiple-item:  # multiple resources
    source:
      - url: ../data/Employee data.sav
        format: spss
      - url: ../data/invalid.csv
        format: csv
    oai-id: 851501

Where oai-url is the entry point for the OAI service, and entries is a collection of resources which we're interested in harvesting from UKDS, and uploading to datahub.io.

If an entry has an oai-id property, this will be used to harvest dataset metadata from UKDS to populate the datapackage.

The views property is a list of file paths to json files containing view-spec compatible views.