Ship Data Package scripts as Data Package Pipelines

danfowler commented 7 years ago

As part of https://github.com/frictionlessdata/pilot-dm4t/issues/20, we would like to provide access to these datasets on a flat file data store. We can do this while simultaneously demonstrating how to use Data Package tooling. In this case, we will use Data Package Pipelines to create processing scripts for enriching and transporting these datasets from their source formats to a Data Packaged set of CSV files.

[ ] custom processors for all
- [ ] Write .to_aws() processor that dumps Data Package to S3 bucket
- [ ] Write custom processors to accomplish things that (seemingly) can't be done by datapackage-pipelines by default (updating descriptions in a programmatic way)
- [ ] (OPTIONAL) Write .to_dpr() processor that dumps Data Package to DPR API
[ ] enliten
- [ ] Write Data Package Pipeline that reads data as close to source format as possible (SQL) and outputs the generated Data Packages on a flat file store
[ ] refit-cleaned
- [ ] Write Data Package Pipeline that reads data as close to source format as possible (cleaned CSVs for refit-cleaned) and outputs the generated Data Packages on a flat file store
[ ] apatsche
- [ ] Write Data Package Pipeline that reads data as close to source format as possible (CSVs) and outputs the generated Data Packages on a flat file store

danfowler commented 7 years ago

Update:

I'm working on a branch using the directory structure listed at the bottom of this post. Briefly, all the datasets are dumped into a local directory called _datasets, custom processors for datapackage-pipelines are stored in _custom_processors.
Data Packages are pushed to a public S3 bucket data.frictionlessdata.io (e.g. enliten)
I wrote a very simple .to_aws() processor, which works well for smaller files
I wrote a very simple processor to update field descriptions with main goal of reducing the size of the YAML file for refit-cleaned (currently already 570 lines).
The files in these datasets are quite massive:

$ du -sh *
8.0G    enliten
4.2G    refit-cleaned

The Internet connections I have available to me are having trouble handling these massive S3 transfers somehow (Cyberduck somewhat works, but aws s3 cp --recursive crashes my Internets!)
The size of these datasets provides a good case for supporting data split across multiple files: http://specs.frictionlessdata.io/data-resource/#data-in-multiple-files
enliten is an SQL database with set of core sensor data tables (co2 gas humidity light pir power sound temperature) with a large number of related tables documenting users, devices, models of devices, units, etc. While there is some support for having tables with relations in the Frictionless Data suite, it doesn't seem like the dataset should be published "as is" with all of these relations in place. Rather, for this dataset, my strategy is to start with core sensor tables from the SQL database, dump them to CSV files using datapackage-pipelines, enrich them with declared types, and then use the join processor from the datapackage-pipelines to selectively join a few of the tables back together. For instance, retrieving the sensor model name using the sensor model _id and publishing that in the dataset on S3.
I would like the ability to drop columns that had been used as the lookup key in a joined dataset, but I am not sure that's practical
datapackage-pipelines takes a loong time to work through these datasets (+10 hours) on my home machine
Given my issues with working with these large datasets on my home machine, I wonder if I need do this all on AWS 🤔 .

.
├── README.md
├── _custom_processors
│   ├── dump
│   │   └── to_aws.py
│   └── update
│       └── modify_descriptions.py
├── _datasets
│   ├── enliten
│   │   ├── data
│   │   │   ├── *.csv
│   │   └── datapackage.json
│   └── refit-cleaned
│       └── data
│           ├── *.csv
├── apatsche
│   └── create_datapackage.ipynb
├── enliten
│   ├── Makefile
│   ├── archive
│   │   ├── enliten-tsv.zip
│   │   ├── enliten.7z
│   │   └── enliten.sql
│   ├── enliten.ipynb
│   ├── pipeline-spec.yaml
│   └── scripts
├── refit-cleaned
│   ├── Makefile
│   ├── README.md
│   ├── Rplots.pdf
│   ├── archive
│   │   ├── *.csv
│   ├── create_datapackage.ipynb
│   ├── example.R
│   ├── metadata.yml
│   ├── pipeline-spec.yaml
│   └── requirements.txt
└── requirements.txt

@akariv @jobarratt @pwalsh @cblop

pwalsh commented 6 years ago

There is now a comprehensive, road-tested aws lib for Data Package Pipelines https://github.com/frictionlessdata/datapackage-pipelines-aws

pwalsh commented 6 years ago

@jobarratt Dan's last notes here are a good starting point for whoever will pick this up.

Clear tasks are to:

Get the data locally
Fix Dan's pipelines to use the new AWS plugin as per my last comment
See if we have any speed improvements now - @rufuspollock told me that app has speed up significantly now - maybe @akariv can help out if we need some tips for that
Commit the updated pipelines, and deploy the updated data to the same bucket ( https://s3.console.aws.amazon.com/s3/buckets/data.frictionlessdata.io/?region=eu-west-1&tab=overview - @vitorbaptista can grant access )

This closes off steps 1 and 2 of the deliverables I listed here https://github.com/frictionlessdata/pilot-dm4t/issues/20

frictionlessdata / pilot-dm4t

Ship Data Package scripts as Data Package Pipelines #21