amercader commented 7 years ago

(Originally posted in https://github.com/frictionlessdata/project/issues/356)

Data validation integration in CKAN

Introduction and prior work
Proposed stages

Introduction and prior work

Goals:

Ensure that published data in selected formats (CSV / Excel) is properly structured
Ensure that selected published data validates against pre-defined schemas
Allow to define a new schema for a data file and validate against it
Make the DataStore import process more robust by providing a schema (rather than guess it)

To achieve these goals we aim to build mostly on top of specifications and tooling from the Frictionless Data project, a collection of standards and libraries to make easier to get, share and validate data.

More specifically the following components are likely to have a central role on the implementation:

Table Schema: a specification and library for declaring a schema for tabular data.
The goodtables library, a powerful Python library for performing structural and schema-based validation on tabular data.
goodtables.io, a free web service to validate tabular data with an API to provide continuous data validation to third parties. (See example report).

Of course we want to build on existing work and align with other recent developments, specially on the DataStore front like data dictionaries.

This is an interface that in a sense already allows to create a sort of schema to describe the data in the DataStore:

data_dictionary

This metadata is currently stored in Postgres itself, but we believe this can be done in parallel to the approach we are are proposing (storing a standard Table Schema in the resource itself, see point 1). Also the current interface is built with information from data already stored in the DataStore, and we are suggesting that this information can be entered by users at dataset creation time (see points 6 and 8). We hope to use the current interface as a starting point to extend it and improve to cover data types and make it more user friendly.

Proposed stages

The goals mentioned before can be achieved progressively. We propose a gradual approach starting from a low level and and building towards more user-facing features. Note that these stages are not necessarily dependent on the previous one (although some are).

1. Store data schema against a resource

Rsources will have an optional schema field (adding this and schema_type, schema_version to the resource model has been discussed previously).

Unless stated otherwise on schema_type, the schema will be a Table Schema.

Storing the data schema in the metadata makes sense as this will be exposed via the API (package_show) so other tools integrating with CKAN will be able to know beforehand the structure of the data.

Estimate: 1 day

2. Action functions to validate against a resource

At the heart of the validation process there is a simple call to the goodtables library, passing the path or URL to a tabular file and optionally a schema (and some options if necessary).

The validation process could potentially be a long one so they should be performed asynchronously.

There are two options for performing the actual validation check:

Run goodtables locally using the new background jobs.
Use the external goodtables.io API to create a remote validation job.

Both approaches have advantages and disadvantages. While using the already available goodtables.io API might save time in building all the background jobs, model etc it also presents challenges like private datasets, instances behind a firewall...

In both cases the end result is a report object that can be stored on the database against the resource id, and externally the API actions should be the same:

    resource_validation_run
        * Params: resource_id
        * Returns: A job id that can be used to query the status with resource_validation_show

    resource_validation_show
        * Params: job_id
        * Returns: a validation report

Or if we want to keep things simple and just store the most recent validation job report:

    resource_validation_run
        * Params: resource_id
        * Returns: True if all went well and the validation job was started

    resource_validation_show
        * Params: resource_id
        * Returns: the last validation report (if available)

resource_validation_run just fires a validation background job (after doing all necessary checks like if the resource exists, format etc) and creates a new entry on the database table to show there is a pending job. The validation runs in the background and once it finishes updates the database with the report.

resource_validation_show queries the database to see if there is a pending or finished job for that particular resource and returns the report.

The validation will make use of a schema if there is one stored in the resource schema field.

The database model will be a very simple validation table with the following fields:

job_id
resource_id
status
created
finished
report

Estimate: 5 days

3. Automatic validation on resource update (file, URL or schema)

Once the actions and background job are in place it will be very easy to make them automatic on resource updates by implementing IResourceController's after_update / after_create.

When updating validation should only be triggered if the data source (file upload or URL) or the schema changes. This might be difficult as we will need to compare against the previous version of the resource).

To make the easier to link resources to validation reports we can store validation_result and validation_date extras in the resource itself (and then the report can be accessed via resource_validation_show())

Estimate: 1 day

4. Render validation report on the UI

Basically the idea is to display in a direct and intuitive way whether a tabular resource is valid or not, from the resource page (or even the listing of resources in the dataset page). From there we will link to a fully rendered report with all the validation errors, including error descriptions and a preview of the offending rows if relevant.

A simple first step for could be to display the valid ( ) or invalid ( ) badges on the resource pages, and make them clickable.

badges

This will link to a dedicated page with the validation report rendered. There are readily available components that render a JSON report object into HTML, using various common frontend frameworks like React, Angular or Vue. It would be just a matter of adding a new endpoint (/dataset/{id}/resource/{id}/validation) that queries the last report (if any) and renders it with the default CKAN theme.

Example page rendered from a report JSON object

This same page could have a control to re-run the validation (available to users that can edit the resource).

Estimate: 3 days

5. Make the DataPusher / DataStore Table Schema aware

The end goal is that data is properly stored in the DataStore using the correct field types (and even constraints).

For all the issues it causes, DataPusher is still the main way in which data gets into the DataStore. Implementation details aside and knowing there are probably better alternatives being proposed, the current DataPusher app is conceptually simple: it reads the data from the tabular file and uploads it in chunks to the DataStore via the API.

One problematic aspect of it is that in many cases the upload process fails because of badly defined field types for the actual data. This is because DataPusher uses a library called messytables to guess the file types based on a sample of the data.

Guessing is always going to be error-prone, even if we switch to a different library. So even if it isn't always possible, we should encourage publishers to describe their data, storing the fields definition in the schema field of the resource (See point 1). Previous discussion.

Regardless of how this Schema object was stored in the resource (see point 6), if DataPusher gets a resource that has a schema, and this includes the field types, it will use it to create the field definitions that datastore_create supports, rather than guess the field types (this still can be the default behaviour if no schema is provided).

Table Schema is the perfect specification to use to define these fields: it's a light-weight, widely supported standard for describing tabular data. The mapping between the field types it offers and the Postgres-centric ones that the DataStore currently supports is also straight-forward.

DataStore field type	Table Schema field type
text	string
float	number
int	integer
bool	boolean
date	date
time	time
timestamp	datetime
json	object
Other complex types in Table Schema could default to text or object	duration, geopoint, geojson, ...

We have two options as to where this mapping from Table Schema to Datastore fields is done:

Mapping done in DataPusher: if there is a schema, DataPusher translate the field definitions to what DataStore currently understands
Mapping done in DataStore: apart from the current fields definition supported now (kept for comaptibility), DataStore supports directly Table Schema and each backend translates it internally to the relevant field types. DataPushers sends the schema from the resource to the DataStore if present.

My vote goes for 2 as it is more generic and it would allow to connect the DataStore into a wider ecosystem of tools that use Table Schema.

Estimate: 2 days

6. Assign a schema on dataset creation / update

Up until now we've been mentioning low-level modules that take advantage of resources having a schema attached to them, but of course this schema needs to be provided somewhere.

The fist basic step is to be able to provide a schema on the resource create/update form. The end result would be a schema object (or a URL pointing to one?) in the schema field of the resource but the interface could be:

A simple text box to paste a schema object, or a file upload / URL field for providing an external one
Select a pre-defined schema stored in the database (See point 7)
Schema editor (See point 8)

Estimate: 0.5 days (option 1)

7. Schema store

In some cases, specially on large CKAN instances schemas may be reused when publishing different datasets across organizations, for instance spending data that must adhere to a predefined standard.

A simple interface could be provided to add new schemas (using the same methods as point 6), and store them with some basic metadata like a user friendly title.

Schemas would be validated to endure they have the correct syntax and stored in a dedicated table in the database.

On the resource form then users could choose from the existing schemas when creating the resource. The actual schema object would be stored on the resource, or a publicly accessible endpoint for the relevant schema (eg /schemas/data/{id}).

In the future this same interface could be extended to manage metadata schemas (when ckanext-scheming is integrated into core).

Estimate: 3-6 days (needs proper spec)

8. Schema editor based on the actual fields

Providing an existing schema that has been created elsewhere, probably by hand, is a great feature but one that will be probably only used by power users or data-savvy publishers that are familiar with validation, standards, etc.

To really engage publishers in the description of their data we need to guide the creation of these schemas based on the actual contents of the file, what people is familiar with.

In a nutshell, when uploading or linking to a new file, the user gets a list of the existing fields, with an option to define the type of that field (a guessed one is provided for them). Additionally they can provide extra information about the field like user-friendly labels or a description.

This gets transformed into a Table Schema internally that gets stored in the schema field.

This pattern is well established (see eg Socrata), the challenge is how to integrate it in the existing workflow in CKAN for creating a dataset.

If the user uploads the file it seems like the best experience would be to read the file in the browser using tableschema-js and infer a schema that would be used to generate the edit interface. This tool shows how the general interaction would be (and perhaps we can reuse part of it):

https://csv-schema.surge.sh/

Estimate: 4-6 days (needs proper spec)

9. Validate on dataset creation

What will really make a difference in terms of improving data quality will be the integration of the validation process in the process for adding new datasets.

Running the validation before the creation of the dataset will highlight errors early in process and engage publishers in fixing those issues.

The actual workflow will depend on each CKAN instance and the publishers needs, for instance some might prevent the dataset from being created if there are errors, or allow the creation of the dataset but not upload it to the DataStore, or in some case cases just warn the user.

Regardless of how the process continues after the validation, the main implementation issue would be how to integrate an asynchronous validation job with the multi-step process now in place for creating datasets. The resource endpoints of the package controller are infamous for their complexity, caused by how this multi-stage process is implemented.

TODO: expand potential implementation and estimate

wardi commented 7 years ago

Sounds great. The type override field on the data dictionary should be hidden when there's a schema defining the types.

We should also avoid float and int types in the datastore because users might be surprised by their limitations. numeric would be a better choice.

amercader commented 7 years ago

@wardi great to hear that!

+1 to numeric

BTW, maybe I just imagined but I vaguely remember a recent comment from you on a pull request or issue mentioning a potential new IUploader interface that allowed to store files before creating the dataset/resource. I can't find it now so maybe I just dreamt it. In any case that would help with things like points 8 and 9, so I'd be curious to know your thoughts on this (if you really had some!)

roll commented 5 years ago

@amercader I guess it makes sense moving it to some documentation (e.g. readme)?

ckan / ckanext-validation

Overview (move to the docs?) #1