Data Transformations - Githubissues

amercader commented 9 years ago

Originally submitted by @kindly on ckan/ckan#1394

Purpose

To give CKAN basic data transformation abilities. These transformations will happen to data within the CKAN datastore.

This will initially only include basics like importing data into the datastore, changing the type of a column and renaming a column.
However, in future this could consist of many other transformation options such as data cleaning, aggregation or normalization.

Implementation

The model will have one extra table named "transformation" with the following fields.

id (UUID): Unique id.
resource_id (UUID): CKAN resource id.
transformation_type (TEXT): Type of process.
config (JSON string): Config options to pass onto this transformation type.
stage (INT): Stage this transformation happens in.
task_data (JSON): JSON data of information returned by external service. (this can be used to save logs or links to logs of that service)
complete (BOOL): The transformation process completed this stage successfully.

Each transformation type will have its own set of rules and will call an external service to do the transformation. The only external service initially supported will be the datapusher. The datapusher extension will also manage the moving of the process onto the next stage.

The first stage will normally consist of the datapusher storing the csv/excel data to the datastore.

Each stage will call back to CKAN and report how well it went and if it was successful request the next stage if available.
If one of the stages fails then the process will stop and the logs against that stage will give information on the failure. There will be an option to rerun all the tasks or just try the last failed one again. Also there will be an option to clear all the Transformations and start configuring the whole process again.

Process Diagram

      +------------+           +---------------------+         +-----------------------+
      |   CKAN     |           |   CKAN Datapusher   |         |   Datapusher Service  |
      +------------+           +---------------------+         +-----------------------+
      | UI to Start+---------->|     Request         +-------->|                       |
      | Process    |           |     Stage 1         |         |     Run               |
      +------------+           +---------------------+         |     Stage 1           |
      |            | if fail   |  Store Stage 1      |         |                       |
      | Show       |<----------+     Result          |<--------+                       |
      | Stage 1    |           |                     |         +-----------------------+
      | Log        |           |     Request         |         |                       |
      |            |           |     Stage 2         +-------->|     Run               |
      +------------+           +---------------------+         |     Stage 2           |
      | Show       | if fail   |  Store Stage 2      |         |                       |
      | Stage 2    |<----------+     Result          |<--------+                       |
      | Log        |           |                     |         +-----------------------+
      +------------+           |                     |
      | Show       |<----------+   Report Success    |
      | Success    |           +---------------------+
      +------------+

Interface

In the datastore tab in the resource edit will have a list of the stages and option to configure the last stage added. Each stage that has been run will have an expandable view of the datapusher log of what happened in that stage, this will be automatically expanded for failure cases.

There will be button to reload the data with the same Transformations. This will be used if you want to reprocess changed underlying resource data. There will also be a button to clear all the transformations and start adding tranformations from scratch.

rufuspollock commented 9 years ago

I'd strongly argue for this being a separate service. In that case, the real question is what, if anything, is needed in CKAN to support good integration with the third-party service.

wardi commented 9 years ago

except many transformations are much faster if done with a direct connection to the DB.

If the data isn't large it would be nicer to perform a transformation with an external service that pushes the changes into a new resource. Or maybe look at tabular data store options that allow revisions.

rossjones commented 9 years ago

except many transformations are much faster if done with a direct connection to the DB

I agree, and I think it'd be nice if we could improve the speed that the data flies out of the database when not with a direct connection. For instance, it'd be nice to be able to stream the data out of the datastore, even something like http://dataprotocols.org/sleep/ - this'd probably make it more datable too.

jqnatividad commented 8 years ago

In NYC, the Mayor's Office of Data Analytics uses DataBridge. which if I'm not mistaken, was built by Accenture as an data integration/analytics platform.

Having a CKAN-aware, enterprise quality ETL platform would go a long way towards ensuring that data published thru CKAN is up-to-date.

That's why we built CKAN writers for Safe FME and Pentaho Kettle and updated the OpenRefine CKAN connector.

What about shortlisting ETL tools like these as part of CKAN deployment best practices? And for open-source tools like Kettle, having a more robust integration, instead of building a lightweight ETL tool?

rossjones commented 8 years ago

I think having decent ETL for CKAN is critical, having decent ETL inside CKAN, less so - especially if it's only the T.

However....

Taking an approach where existing ETL tools have a way of interfacing with CKAN is a good one, for those who have specific use-cases, we should do more of that! But I also think we need a semi-integrated, hosted on github.com/ckan ETL solution that provides building blocks for re-use.

Re-reading kindly’s original proposal I am reading it as CKAN providing the hooks for ETL, but leaving the detail of how/what it actually is to the end user (with external transformation services). As the title suggests it doesn’t really cover the E or L parts of ETL.

So far, my experience with ETL and CKAN has been that there are (at least) three different levels of ETL each with their own pitfalls and dark places.

Automagic ETL

This is the datapusher approach, which works in some cases, but is currently limited and prone to non-error failures (missing header rows etc). This is mentioned in the original proposal, but I think datapusher covers _L_oad and as with all automagic solutions, has its limitation.

Mostly manual ETL

This approach is the most accurate, but also the most resource-consuming process. It works for limited numbers, but it really doesn’t scale particularly well. There are a handful of existing scrape->ckan implementations out there, but they’re of limited re-usability because they are so specific to the data they process.

Semi-magical ETL

Semi-magical ETL is, I think, probably the thing to aim at. I imagine this as a couple of things, it’s providing tools to let people do their own ETL, but also providing CKAN specific functionality to allow for auto-magic where there is enough data to do it safely (i.e. when there is a JTS schema for a dataset).

I think this consists of

a tool (as part of CKAN) that allows users (more likely publishers/admins) to mark-up tabular data and specify whether it has header rows etc. A refined datapusher that only works with resources that have JTS. This is like running resource specific tasks against that resource, but with data we know we can validate with, and create database structures with. We should have a field in the CKAN resource schema for a ‘data-schema’ (which can be xsd for xml, jts for csv etc).
a refinement of https://github.com/ckan/ckan/wiki/CKAN-Transformations to provide more hooks for ETL to integrate with (for recording status, failures of extraction as well as transform and load, etc).
A cleaner way of integrating resource changes into the ETL pipeline without deep-diving into CKAN plugins (hence my interest with ckanext-webhooks). I really want to be able to say “for this resource, when updated, send the resource url and the etl config to this place”. This place might be an external ETL tool (openrefine scripts for instance), or it might be scripts I have specially written for high value datasets on something like morph.io. It might just be a drake-file ( https://github.com/Factual/drake ) that is run locally.

Anyway …

We now have 6 ideas tickets that reference ETL - perhaps we would get further by having a CKAN-ETL Working Group where we can collect use-cases, implementation ideas and try things out? @jqnatividad - what do you think? Up for organising something like this?

edit Just to clarify, I think if we can identify something that does MOST of what we want (perhaps it's kettle), we should definitely use that and integrate more tightly rather than re-inventing the wheel.

ckan / ideas

Data Transformations #96