zelima commented 6 years ago

Dataset automation - Dec 2017

As a user, I want to un-pivot and normalize my remote data, so that I can package it easily and maybe create graphs for it

As a user, I want to remove the column that I do not want to be in DataHub, so that I can represent data that relates to my needs.

As a user, I want to add new column that is a product of other 2 or more columns, so that I can calculate Eg total number of money spent in country

As a user, I want to remove the last row from excel file, so it will be valid tabular data.

As a user I want my data to be clean. I want to find and replace specific string(s) with values I want, so I can build a graph. For example, replace 2017-q2 with 2017-04-01

As a user, I have a compressed remote data that I want to publish on datahub without downloading and decompression it

Acceptance Criteria

[ ] ~~70% of given datasets are automated (see analysis)~~
[x] 95% of features are added

Tasks

major new features needed:

[x] Unpivot (6 datasets)
[x] ~~Remove first rows (11 datasets) - supported by tabulator-py~~
[x] Remove column (5 datasets) - depending on the exact nature
- [ ] ~~remove an empty column~~ - tabulator-py (For now this is not needed)
- [x] remove a column with a header - dpp https://github.com/frictionlessdata/datapackage-pipelines/issues/102
[x] Add columns (1 dataset) - in dpp (with add/remove_columns) - https://github.com/frictionlessdata/datapackage-pipelines/issues/103
[x] Remove Last rows (2 dataset) - tabulator-py - https://github.com/AcckiyGerman/tabulator-py/issues/1
[x] Merge Columns (4 datasets) - in dpp (with add/remove_columns)
[x] ~~Find/replace~~ (3 datasets) - https://github.com/AcckiyGerman/tabulator-py/issues/2
[x] Unzip (3 datasets) - to tabulator-py https://github.com/AcckiyGerman/tabulator-py/issues/3 - see analysis
[x] assembler runs built in processors on demand - https://github.com/datahq/planner/issues/14
[x] prepare flow.yaml for the datasets and push - https://github.com/datahq/pm/issues/85
[x] blog post ?
- [x] Dima is making notes and article about automation

Analysis

I did a bit of research about unzip. As occurred it is supported by tabulator by simply passing `compression=zip` flag. The problem is that for some reasons `dpp`, that is supposed to take care of passing parameters to tabulator, ignores `comprehension` when URL is not ending with `.zip`. For example, I created two identical zip files and uploaded them here:
* zip file without extension https://datahub.io/zelima/zip-file-without-ext/r/zip-file.withoutext
* zip file with extension: https://datahub.io/zelima/zip-file-with-ext/r/zip-file.zip

Created two identical `pipeline-spec.yaml` with a change of only URLs - both of them either should fail or succeed, but for some reasons one with extension runs fine and other fails. Take a look this gist: 
https://gist.github.com/zelima/afecc4f0428055cc6ec9a9f3ce7105ea#file-pipeline-spec-yaml

To make sure problem is in `dpp` and not in `tabulator` following gist shows that `tabulator.Stream` works fine with both URLs 
https://gist.github.com/zelima/afecc4f0428055cc6ec9a9f3ce7105ea#file-stream_zip-py

@acckiygerman seems zip issue is fixed, at least I could automate dataset with zip source @zelima ?

zelima commented 6 years ago

FIXED. Closing this as most of the job is done! all major features are implemented and live. Automation of datasets itself occurred to be much harder accomplish and needs more time and analysis. Follow up it here #85

AcckiyGerman commented 6 years ago

datasets that probably could be automated after some efforts:

cash-surplus-deficit https://github.com/datasets/cash-surplus-deficit
- http://api.worldbank.org/indicator/GC.BAL.CASH.GD.ZS?format=csv
- script is not working ATM
co2-ppm https://github.com/datasets/co2-ppm
- ftp://aftp.cmdl.noaa.gov/products/trends/co2/
- lstrip
- format-date
cpi https://github.com/datasets/cpi
- http://api.worldbank.org/indicator/FP.CPI.TOTL?format=csv
- unpivot
- remove header rows, add header
cpi-us https://github.com/datasets/cpi-us
- source is broken
- unpivot
- calculate new 'inflation' column
- remove last rows
gdp-us https://github.com/datasets/gdp-us
- http://www.bea.gov/national/xls/gdplev.xls
- format-date (quarters -> dates)
- split 2 source into two resources
- merge resources into 2 output files
glacier-mass-balance https://github.com/datasets/glacier-mass-balance
- http://www3.epa.gov/climatechange/images/indicator_downloads/glaciers_fig-1.csv
- source is outdated (not updating)
global-temp https://github.com/datasets/global-temp
- source is broken
- script is very complex :(
global-temp-anomalies https://github.com/datasets/global-temp-anomalies
- merge tables
- complex formats normalizing
natural-gas-prices https://github.com/datasets/natural-gas-prices
- http://www.eia.gov/dnav/ng/hist_xls/RNGWHHDd.xls
- format-date
oil-prices https://github.com/datasets/oil-prices
- source is broken
smdg-master-terminal-facilities-list https://github.com/datasets/smdg-master-terminal-facilities-list
- source is broken
employment-us https://github.com/datasets/employment-us
- source is broken
expenditure-on-research-and-development https://github.com/datasets/expenditure-on-research-and-development
- source is broken

datahubio / datahub-v2-pm

[Epic] Dataset automation #78

Dataset automation - Dec 2017

Acceptance Criteria

Tasks

Analysis

datasets that probably could be automated after some efforts: