pwalsh commented 7 years ago

Description

We want to close off a small, achievable, and meaningful pilot for the dm4t data. After internal discussion, we believe we can do so in a relatively short time.

Further, we think this will demonstrate some simple yet important and powerful steps for other pilots, and the community at large, in using Frictionless Data specifications and tooling to "progressively enhance" raw data - especially data like this which is the output of a much bigger research project.

There are four high-level outputs for this pilot:

The data processing scripts as a Data Package Pipeline
The Data Packaged data: stored on a flat file data store, providing a minimal, yet unrestricted, "API" to the data
The ability to generate a search API over the data. This will be a a single command that reads a collection of data packages (like in an S3 bucket), and generates a Search Index using Elasticsearch
A report outlining the details of the case study, and the progressive enhancement approach afforded by the specifications and tooling, from a set of unrelated data sources, to a standardised filesystem for data access, to higher-level tools like a derived, immutable search index. We'd expand on the benefits of this approach, and in particular on the angles related to long-term data access, cost effectiveness, and the affordances of the flat file "point of truth" with derived databases approach.

Tasks

[ ] Shippable data processing scripts for DPP - doing in #21
[ ] Flat file datastore on S3 - doing in #21
[ ] Demonstrator of flat file access (streaming/casting rows; writing generators that join and process two or more datasets)
- [ ] Perhaps as a Python module that can be called from command line, picking out data from the flat file store, doing some aggregations or other some such actions on the cast data stream
[ ] Simple Python module to call from command line, and/or AWS Lambda, to stream from S3 to an AWS-managed Elasticsearch with Kibana running on it
- Use https://github.com/frictionlessdata/tableschema-elasticsearch-py to create index
- May be an issue with volume of data. If yes, select a subset
[ ] Demonstrator example of using the Elasticsearch API for the data
- [ ] Include a Kibana vis as part of demonstrator
[ ] Final pilot report @jobarratt see my notes in point 4. I will help finish the report.

@vitorbaptista can grant access for S3 and Elasticsearch service

pwalsh commented 7 years ago

cc @jobarratt

jobarratt commented 7 years ago

@danfowler as a first step please can you estimate the time needed for these tasks

cblop commented 7 years ago

@danfowler has already achieved outputs 1 and 2 for the REFIT data. Do you plan to implement outputs 1 - 4 for all of the datasets we have access to (Enliten, REFIT, Apatsche)?

We've also recently gained access to another dataset from Loughborough, confusingly also under the REFIT umbrella, structured very differently from the REFIT data we already have. Do you want to run the pilot on that data as well?

danfowler commented 7 years ago

@cblop yes, I plan on doing for all the datasets we have access to. (Do you have any insight on how to model Apatsche (https://github.com/frictionlessdata/pilot-dm4t/issues/17)?)

I could optionally redo REFIT very simply as a datapackage-pipeline, but probably not necessary if we make adhere to same flow as the others.

Re: Loughborough, can you link to the dataset? I imagine if it is straightforward enough to package we can do, but our aim is to close this off rather soon.

@pwalsh

For shipping Data Packaging scripts, I would like to do with datapackage-pipelines. I imagine I can write a simple .to_aws dumper (right, @akariv) to package the whole process.
If we are shipping to AWS, we can also hook up goodtables.io to exercise the service on large datasets. Data validation will happen through dpp, too, as we are setting types, though.
Single command for ES API generator: we can also see what is generated from @jcockhren

@jobarratt estimates in Trello

pwalsh commented 6 years ago

We will skip the Apatsche ( See https://github.com/frictionlessdata/pilot-dm4t/issues/17 )

pwalsh commented 6 years ago

@jobarratt

I've updated the task list in the first issue description.

Whoever takes this on needs to do #21 and then come back here to complete the rest of the tasks.

frictionlessdata / pilot-dm4t

Delivery phase #20

Description

Tasks