datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

[Epic] Dataset automation #78

Closed zelima closed 6 years ago

zelima commented 6 years ago

Dataset automation - Dec 2017

As a user, I want to un-pivot and normalize my remote data, so that I can package it easily and maybe create graphs for it

As a user, I want to remove the column that I do not want to be in DataHub, so that I can represent data that relates to my needs.

As a user, I want to add new column that is a product of other 2 or more columns, so that I can calculate Eg total number of money spent in country

As a user, I want to remove the last row from excel file, so it will be valid tabular data.

As a user I want my data to be clean. I want to find and replace specific string(s) with values I want, so I can build a graph. For example, replace 2017-q2 with 2017-04-01

As a user, I have a compressed remote data that I want to publish on datahub without downloading and decompression it

Acceptance Criteria

Tasks

major new features needed:

Analysis

I did a bit of research about unzip. As occurred it is supported by tabulator by simply passing `compression=zip` flag. The problem is that for some reasons `dpp`, that is supposed to take care of passing parameters to tabulator, ignores `comprehension` when URL is not ending with `.zip`. For example, I created two identical zip files and uploaded them here:
* zip file without extension https://datahub.io/zelima/zip-file-without-ext/r/zip-file.withoutext
* zip file with extension: https://datahub.io/zelima/zip-file-with-ext/r/zip-file.zip

Created two identical `pipeline-spec.yaml` with a change of only URLs - both of them either should fail or succeed, but for some reasons one with extension runs fine and other fails. Take a look this gist: 
https://gist.github.com/zelima/afecc4f0428055cc6ec9a9f3ce7105ea#file-pipeline-spec-yaml

To make sure problem is in `dpp` and not in `tabulator` following gist shows that `tabulator.Stream` works fine with both URLs 
https://gist.github.com/zelima/afecc4f0428055cc6ec9a9f3ce7105ea#file-stream_zip-py

@acckiygerman seems zip issue is fixed, at least I could automate dataset with zip source @zelima ?

zelima commented 6 years ago

FIXED. Closing this as most of the job is done! all major features are implemented and live. Automation of datasets itself occurred to be much harder accomplish and needs more time and analysis. Follow up it here #85

AcckiyGerman commented 6 years ago

datasets that probably could be automated after some efforts: