Open danfowler opened 7 years ago
Update:
I'm working on a branch using the directory structure listed at the bottom of this post. Briefly, all the datasets are dumped into a local directory called _datasets
, custom processors for datapackage-pipelines are stored in _custom_processors
.
Data Packages are pushed to a public S3 bucket data.frictionlessdata.io
(e.g. enliten)
I wrote a very simple .to_aws()
processor, which works well for smaller files
I wrote a very simple processor to update field descriptions with main goal of reducing the size of the YAML file for refit-cleaned (currently already 570 lines).
The files in these datasets are quite massive:
$ du -sh *
8.0G enliten
4.2G refit-cleaned
The Internet connections I have available to me are having trouble handling these massive S3 transfers somehow (Cyberduck somewhat works, but aws s3 cp --recursive
crashes my Internets!)
The size of these datasets provides a good case for supporting data split across multiple files: http://specs.frictionlessdata.io/data-resource/#data-in-multiple-files
enliten
is an SQL database with set of core sensor data tables (co2
gas
humidity
light
pir
power
sound
temperature
) with a large number of related tables documenting users, devices, models of devices, units, etc. While there is some support for having tables with relations in the Frictionless Data suite, it doesn't seem like the dataset should be published "as is" with all of these relations in place. Rather, for this dataset, my strategy is to start with core sensor tables from the SQL database, dump them to CSV files using datapackage-pipelines, enrich them with declared types, and then use the join
processor from the datapackage-pipelines to selectively join a few of the tables back together. For instance, retrieving the sensor model name
using the sensor model _id
and publishing that in the dataset on S3.
I would like the ability to drop columns that had been used as the lookup key in a joined dataset, but I am not sure that's practical
datapackage-pipelines takes a loong time to work through these datasets (+10 hours) on my home machine
Given my issues with working with these large datasets on my home machine, I wonder if I need do this all on AWS 🤔 .
.
├── README.md
├── _custom_processors
│  ├── dump
│  │  └── to_aws.py
│  └── update
│  └── modify_descriptions.py
├── _datasets
│  ├── enliten
│  │  ├── data
│  │  │  ├── *.csv
│  │  └── datapackage.json
│  └── refit-cleaned
│  └── data
│  ├── *.csv
├── apatsche
│  └── create_datapackage.ipynb
├── enliten
│  ├── Makefile
│  ├── archive
│  │  ├── enliten-tsv.zip
│  │  ├── enliten.7z
│  │  └── enliten.sql
│  ├── enliten.ipynb
│  ├── pipeline-spec.yaml
│  └── scripts
├── refit-cleaned
│  ├── Makefile
│  ├── README.md
│  ├── Rplots.pdf
│  ├── archive
│  │  ├── *.csv
│  ├── create_datapackage.ipynb
│  ├── example.R
│  ├── metadata.yml
│  ├── pipeline-spec.yaml
│  └── requirements.txt
└── requirements.txt
@akariv @jobarratt @pwalsh @cblop
There is now a comprehensive, road-tested aws lib for Data Package Pipelines https://github.com/frictionlessdata/datapackage-pipelines-aws
@jobarratt Dan's last notes here are a good starting point for whoever will pick this up.
Clear tasks are to:
This closes off steps 1 and 2 of the deliverables I listed here https://github.com/frictionlessdata/pilot-dm4t/issues/20
As part of https://github.com/frictionlessdata/pilot-dm4t/issues/20, we would like to provide access to these datasets on a flat file data store. We can do this while simultaneously demonstrating how to use Data Package tooling. In this case, we will use Data Package Pipelines to create processing scripts for enriching and transporting these datasets from their source formats to a Data Packaged set of CSV files.
[ ] custom processors for all
.to_aws()
processor that dumps Data Package to S3 bucket.to_dpr()
processor that dumps Data Package to DPR API[ ] enliten
[ ] refit-cleaned
[ ] apatsche