frictionlessdata / pilot-dm4t

Pilot project with DM4T
http://www.cs.bath.ac.uk/dm4t/index.shtml
1 stars 1 forks source link

Delivery phase #20

Open pwalsh opened 7 years ago

pwalsh commented 7 years ago

Description

We want to close off a small, achievable, and meaningful pilot for the dm4t data. After internal discussion, we believe we can do so in a relatively short time.

Further, we think this will demonstrate some simple yet important and powerful steps for other pilots, and the community at large, in using Frictionless Data specifications and tooling to "progressively enhance" raw data - especially data like this which is the output of a much bigger research project.

There are four high-level outputs for this pilot:

  1. The data processing scripts as a Data Package Pipeline
  2. The Data Packaged data: stored on a flat file data store, providing a minimal, yet unrestricted, "API" to the data
  3. The ability to generate a search API over the data. This will be a a single command that reads a collection of data packages (like in an S3 bucket), and generates a Search Index using Elasticsearch
  4. A report outlining the details of the case study, and the progressive enhancement approach afforded by the specifications and tooling, from a set of unrelated data sources, to a standardised filesystem for data access, to higher-level tools like a derived, immutable search index. We'd expand on the benefits of this approach, and in particular on the angles related to long-term data access, cost effectiveness, and the affordances of the flat file "point of truth" with derived databases approach.

Tasks

@vitorbaptista can grant access for S3 and Elasticsearch service

pwalsh commented 7 years ago

cc @jobarratt

jobarratt commented 7 years ago

@danfowler as a first step please can you estimate the time needed for these tasks

cblop commented 7 years ago

@danfowler has already achieved outputs 1 and 2 for the REFIT data. Do you plan to implement outputs 1 - 4 for all of the datasets we have access to (Enliten, REFIT, Apatsche)?

We've also recently gained access to another dataset from Loughborough, confusingly also under the REFIT umbrella, structured very differently from the REFIT data we already have. Do you want to run the pilot on that data as well?

danfowler commented 7 years ago

@cblop yes, I plan on doing for all the datasets we have access to. (Do you have any insight on how to model Apatsche (https://github.com/frictionlessdata/pilot-dm4t/issues/17)?)

I could optionally redo REFIT very simply as a datapackage-pipeline, but probably not necessary if we make adhere to same flow as the others.

Re: Loughborough, can you link to the dataset? I imagine if it is straightforward enough to package we can do, but our aim is to close this off rather soon.

@pwalsh

  1. For shipping Data Packaging scripts, I would like to do with datapackage-pipelines. I imagine I can write a simple .to_aws dumper (right, @akariv) to package the whole process.
  2. If we are shipping to AWS, we can also hook up goodtables.io to exercise the service on large datasets. Data validation will happen through dpp, too, as we are setting types, though.
  3. Single command for ES API generator: we can also see what is generated from @jcockhren

@jobarratt estimates in Trello

pwalsh commented 6 years ago

We will skip the Apatsche ( See https://github.com/frictionlessdata/pilot-dm4t/issues/17 )

pwalsh commented 6 years ago

@jobarratt

I've updated the task list in the first issue description.

Whoever takes this on needs to do #21 and then come back here to complete the rest of the tasks.