frictionlessdata / pilot-dm4t

Pilot project with DM4T
http://www.cs.bath.ac.uk/dm4t/index.shtml
1 stars 1 forks source link

Ship Data Package scripts as Data Package Pipelines #21

Open danfowler opened 7 years ago

danfowler commented 7 years ago

As part of https://github.com/frictionlessdata/pilot-dm4t/issues/20, we would like to provide access to these datasets on a flat file data store. We can do this while simultaneously demonstrating how to use Data Package tooling. In this case, we will use Data Package Pipelines to create processing scripts for enriching and transporting these datasets from their source formats to a Data Packaged set of CSV files.

danfowler commented 7 years ago

Update:

$ du -sh *
8.0G    enliten
4.2G    refit-cleaned
.
├── README.md
├── _custom_processors
│   ├── dump
│   │   └── to_aws.py
│   └── update
│       └── modify_descriptions.py
├── _datasets
│   ├── enliten
│   │   ├── data
│   │   │   ├── *.csv
│   │   └── datapackage.json
│   └── refit-cleaned
│       └── data
│           ├── *.csv
├── apatsche
│   └── create_datapackage.ipynb
├── enliten
│   ├── Makefile
│   ├── archive
│   │   ├── enliten-tsv.zip
│   │   ├── enliten.7z
│   │   └── enliten.sql
│   ├── enliten.ipynb
│   ├── pipeline-spec.yaml
│   └── scripts
├── refit-cleaned
│   ├── Makefile
│   ├── README.md
│   ├── Rplots.pdf
│   ├── archive
│   │   ├── *.csv
│   ├── create_datapackage.ipynb
│   ├── example.R
│   ├── metadata.yml
│   ├── pipeline-spec.yaml
│   └── requirements.txt
└── requirements.txt

@akariv @jobarratt @pwalsh @cblop

pwalsh commented 6 years ago

There is now a comprehensive, road-tested aws lib for Data Package Pipelines https://github.com/frictionlessdata/datapackage-pipelines-aws

pwalsh commented 6 years ago

@jobarratt Dan's last notes here are a good starting point for whoever will pick this up.

Clear tasks are to:

This closes off steps 1 and 2 of the deliverables I listed here https://github.com/frictionlessdata/pilot-dm4t/issues/20