ckan / ideas

[DEPRECATED] Use the main CKAN repo Discussions instead:
https://github.com/ckan/ckan/discussions
40 stars 2 forks source link

Multi-pass datapusher #150

Closed jqnatividad closed 6 years ago

jqnatividad commented 9 years ago

As discussed with @wardi and @davidread at IODC/CKANcon:

Pass 1:

Pass 2:

Pass 3 (optional):

Pass 4, etc.

BENEFITS:

rufuspollock commented 9 years ago

@jqnatividad first I note this is really about creating a proper ETL platform for CKAN - see e.g. #18. Some thoughts follow:

Connect / Reuse Frictionless Data and Data Package

Connect this with the Frictionless Data / Data Package work. That already has:

What Workflow We Want

Overall the workflow is something like this: (see also the "perfect workflow" section in this OpenSpending user story):

Aside: I note we have working spike of this approach for command line only related to OpenSpending: https://github.com/openspending/oscli-poc

Aside: I also assume folks saw things like http://okfnlabs.org/blog/2014/09/11/data-api-for-data-packages-with-dpm-and-ckan.html - this is automated load of data into CKAN DataStore - with type guessing ...

Focus Right Now

Right now my suggestion would be to:

Descriptive Statistics

Definitely useful but probably separate activity. Again I would connect with Tabular Data Package stuff - see e.g. http://data.okfn.org/roadmap#tabular-stats-tool

General Enrichment Services

Again, fantastic, let's just create a model for these - and we probably want to have integrated UX but separate services - ie. this runs "standalone" from an implementation perspective but UX is integrated - this has been discussed for a number of services already e.g. the link checker (dead or alive) and is key for datapusher generally i think.

wardi commented 9 years ago

I'm thinking of a simple extensible version of this:

  1. Raw/schemaless datastore import by adding support for postgres arrays to datastore field types
    • import complete XLS or CSV contents into datastore table with as a single column containing a postgres array of strings
    • guaranteed to work for any XLS or CSV
    • fast: use COPY in large batches of rows
    • collect and update minimal information while copying: # columns, min/max float/int value seen for each column (or None if column contains entries invalid for the type) this could happen entirely client side (datapusher)
  2. datastore_alter: Alter table in postgres
    • copy data from raw array into real columns with types, apply indexes etc.
    • fast because it's all in postgres operating on data it already has (no looping in python)

Building a nice interface to allow users to to fancy things like confirming column types or ETL to clean up data can happen after we have these low-level actions in place.

related tickets: #123 #124

jqnatividad commented 9 years ago

Thanks @rgrp, @wardi for your feedback.

This seems to be a recurring requirement as evidenced by the multiple references in this repo.

Combined Workflow

Maybe we can combine approaches, using pgloader.io instead of native COPY? Keeping the new workflow as close to the current CKAN 2.3 upload workflow?

  1. Give me a CSV / Excel file
  2. Upload raw file to Filestore as is

no change to current CKAN workflow, so far

On Manage/DataStore tab of Resource.

  1. I guess schema (JSON Table Schema) and present it to you with some of your data in a nice way - e.g. I show you the table in a table structure with drop down for column types, ability add descriptions etc
  2. you edit and confirm schema using a simple interface. Schema definition is stored in "JSON Table Schema" resource attribute extra field. (Aside: perhaps, using PostgreSQL's 9.4's new JSONB data type?)
  3. Schema definition is used to generate pgloader.io commands
  4. import is done using pgloader.io.
  5. based on pgloader.io results, report issues, and optionally go back to 2 if the load failed
  6. if pgloader.io is successful, create a "Data Dictionary" resource view that renders the JSON Table Schema in a nice human-readable format.

only modifying the DataStore workflow

In this way, we leverage not only Frictionless Data/Data Package, but we also use pgloader.io which seems to be an excellent way to async-load data into postgres, even better than the native COPY command. I consciously aligned my user-story to the current CKAN upload workflow as I think it needs to be refactored insofar as the Datapusher/Datastore is concerned.

Descriptive Stats

And as @rgrp suggested, the descriptive stats stuff can come later, but once a dataset is loaded into Postgres, computing these stats is pretty straightforward with SQL (i.e. min, max, distinct, etc.). The JSON Table Schema field constraints can even be repurposed to show the stats, though they are not really "constraints" per se, but a description of the loaded table. I still see this as a related project to the work above in a future iteration.

General Enrichment Services

An extensible framework would be great! Once we have a more robust way of onboarding data into the Datastore, having an enrichment framework would really go a long way towards in enhancing CKAN as open infrastructure on top of which enterprise-class solutions can be built. #dogfooding :wink:

And I can envision a whole class of data enrichment services that the community can develop. This deserves its own ticket, but only makes sense once we have a more robust datapusher, especially since the enrichment services will actually require the JSON Table Schema.

So the JSON Table Schema is not only there to support the Data Dictionary view, and to support the schema defn interface during load time, it will also support these enrichment services.

wardi commented 9 years ago

@jqnatividad I like it.

Can we integrate with pgloader safely with the command line or is it going to take lots of fragile script generation and output parsing?

jqnatividad commented 9 years ago

@wardi, only one way to find out :smile: I only found out about pgloader.io this week, but it seems to be widely used and the author is a PostgreSQL major contributor.

As the current datapusher is a stand-alone CKAN service and runs asynchronously, I think its a natural fit. The pgloader CLI options seems to be quite extensive, and maybe we can have a template DSL we can parameterize and create a next-gen datapusher using https://github.com/ckan/ckan-service-provider.

BTW, found this paper by the pgloader.io author that's quite interesting - http://rmod.lille.inria.fr/archives/dyla13/dyla13_3_Implementing_pgloader.pdf

jqnatividad commented 9 years ago

As a side benefit of using pgloader, we get to support additional file formats like fixed, Postgres COPY, Dbase, IXF, and SQLite.

It's also interesting that pgloader can directly connect to MySQL and MS SQL Server using a connection string. For data publishers, this is a great way to direct-load data from transaction systems skipping an intermediate ETL step.

Perhaps, pgloader support in the ckanapi and/or ckan-import as well? It could be a thin wrapper that directly uses the pgloader DSL and just uses the Datastore API to register the resource with CKAN and associate it with a package.

maxious commented 9 years ago

Some interesting projects from github: https://github.com/opencivicdata/pupa is a Python library/DSL for scraping HTML pages in an ETL way https://github.com/jdorn/json-editor is a Javascript library for generating UIs to edit JSON based on JSON Schema definitions

jqnatividad commented 9 years ago

And the pieces are starting to fall in place. json-editor looks really cool! @rgrp mentioned that schema editing is coming soon. Should we wait for that, or should we use/leverage json-editor instead?

rufuspollock commented 9 years ago

@jqnatividad i think we should still get a proper architecture diagram here and really agree how different pieces fit together so people can then go off and work on specific bits. As I said, I'd suggest using JSON Table Schema as a core piece which other bits can then work off. Also cc'ing @pwalsh as he is doing a lot of work on Data Package and Frictionless Data stuff with Open Knowledge atm.

pwalsh commented 9 years ago

Interesting discussion.

About schema editing - yes, we are working on one right now as a fairly generic interface to edit Data Package schemas, and we do use json-editor for that: https://github.com/okfn/datapackagist and http://datapackagist.okfnlabs.org/ - we have some UX work to do there, but functionality-wise, there is a quite complete application there already. It is still WIP so YMMV.

As it might not be immediately apparent from the DataPackagist UI, there are additional things here beyond schema editing as such:

For JSON Table Schema, https://github.com/okfn/json-table-schema-js and https://github.com/okfn/json-table-schema-py (jtskit on PyPI, as we only just last week took over the json-table-schema package) are libs for handling JTS, including inferring from data. These are more in line with current JTS spec than say, the JTS features in messytables.

Also, it might be interesting to note here the differences between GoodTables and MessyTables.

MessyTables tries to work with anything you throw at, which is a great approach when working with known messy data and you want to fix it in an automated fashion, or, you don't care about the quality of the actual source file, as long as you can get data out of it.

GoodTables takes a different approach, where we do want to ensure that the source data is well formed, and if it is not, we want specific notifications about where and how it is not, so we can take action to fix the source. GoodTables is implemented as a processing pipeline and has two processors at present: one for structural validation, and one for validation against a JSON Table Schema. It is possible to hook in custom processors, although admittedly the docs are lacking in that department.

jqnatividad commented 9 years ago

Sorry, life happened (vacation, catch-up, work) and disconnected.

Would love to pick this up again, and as @rgrp suggested, go about speccing it out so folks can start work.

Will take a stab at rough spec on GDocs and link it here so we can collaborate on it.

jqnatividad commented 9 years ago

On 16 July 2015, the CSV on the Web Working Group published Candidate Recommendations for Tabular Data on the Web.

How should this inform the proposed work?

rufuspollock commented 9 years ago

I don't think this changes things - it was close to (and based on) the tabular data package and the key thing we probably want here is the json table schema part which we have factored out.

jqnatividad commented 8 years ago

Touching this issue to highlight https://github.com/timwis/csv-schema from @timwis.

https://github.com/timwis/csv-schema/issues/4

davidread commented 7 years ago

Here's a variant that @wardi and I are looking at, to tackle problems getting the column types right during import with DataPusher:

  1. Detect CSV properties (encoding, csv dialect, no. header lines, columns renamed in datastore, column types) using messytables. Store most of these as resource extras (the column types will be stored in postgres in the next step).
  2. Use Postgres' COPY command to quickly import the CSV into DataStore. Avoid value type errors by importing them all as VARCHAR at this point.
  3. Try and convert each column to the preferred type. If a column fails, it fails, but at least the data is in Datastore.
  4. DataStore is told about new table in its db and so it is made available to users.

This would be done in a new extension, replacement for ckanext-datapusher. They would be run on a queue (ckanext-rq). I believe the last step will need a new action function in DataStore, as a replacement for datastore_create but without actually storing the data.

Admins can look at the resource's DataStore tab and now, along with seeing the status & log of the import, in addition they can fix things like the encoding, column types (e.g. phone numbers need to be string to keep the leading zero) etc and trigger a reimport.

jqnatividad commented 6 years ago

Closing this now that @davidread wrote Express Loader (https://github.com/davidread/ckanext-xloader) :)