Closed AcckiyGerman closed 6 years ago
flow.yaml example:
meta:
dataset: <dataset_name>
findability: public
# should match username and id from `cat ~/.config/datahub/config.json`
owner: core
ownerid: core
inputs:
-
kind: datapackage # currently support only datapackages
parameters:
resource-mapping:
# the link to original data-source file
<resource-name>: <http://source-site.com/datafile.csv>
# with the latest changes we don't need the descriptor section
# any more - it is taken from `.datahub/datapackage.json`
# if there is no descriptor in `dp.json` - pipeline will
# infer it automatically from the source file structure.
# So you can now delete 'descriptor' section!
# But also you could leave it and add some intermediate
# resources, if you need to use them in the pipeline later
# (see the pipeline section below).
descriptor:
# this name will be used in the next steps
name: <resource-name>
title: Title
homepage: <http://source-site.com/>
version: 0.0.1
license: <MIT, GPL, etc?>
# This section should match the source data structure.
# Use datapackage-py infer method to get this in json format.
# Also you can copy it from `dataset/datapackage.json` file
# but check schema twice - usually original processing script changed schema
resources:
-
# probably the original data-source will be written in this file
name: <resource-name> # not sure we need this
path: <data/resource-name.csv>
format: csv
mediatype: text/csv
"schema":
"fields":
-
"name": "id"
"type": "integer"
-
"name": "type"
"type": "string"
# etc
# the PROCESSING part describes what to do with data, how to 'process' it
# processors are the programs that will wrangle your data. see:
# https://github.com/frictionlessdata/datapackage-pipelines - dpp
# https://github.com/frictionlessdata/tabulator-py - tabulator
# Each processor takes data from "input: <resource-name>",
# do operations that you define in this section
# and saves data into "output: <resource-name>",
# then next processor takes data there and go on
# (for now we use the same name for input and output)
processing:
- # put this tabulator processor first in a pipeline if the source is zipped
input: <resource-name>
tabulator:
compression: zip
output: <resource-name>
# Datapackage-pipeline operations example. dpp docs is here:
# https://github.com/frictionlessdata/datapackage-pipelines
-
input: <resource-name>
dpp:
- # delete some columns:
run: delete_fields
parameters:
resources: <resource-name>
fields:
- id
- home_link
- keywords
- # unpivot table
run: unpivot
parameters:
resources: <resource-name>
extraKeyFields:
-
name: year
type: year
extraValueField:
name: value
type: number
unpivot:
-
name: ([0-9]{4})
keys:
year: \1
- # replace, e.g. quarters to dates: '1998 Q1' -> 1998-03-31 , Q2 -> 06-31, etc
run: find_replace
parameters:
resources: <resource-name>
fields:
-
name: date
patterns:
-
find: ([0-9]{4})( Q1)
replace: \1-03-31
-
find: ([0-9]{4})( Q2)
replace: \1-06-31
-
find: ([0-9]{4})( Q3)
replace: \1-09-30
-
find: ([0-9]{4})( Q4)
replace: \1-12-31
output: <resource-name>
# how often to run the automation?
schedule: every 1d
@zelima probably we could add the whole Automation guide (it is ready now) here in the planner readme, what do you think? I want to save my work here, coz I put some efforts on it.
@AcckiyGerman we can have flow.md
or something like that and link it from the README. All that automation guide is more related to push
and CLI, or frontend tutorial section, more than here I think
WONTFIX we are not going to implement automation in this way.
As a python developer I want to have more clear readme and definitions so I could use the 'planner' in my work.
Questions I'd like to clarify for me and future developers
[ ] general package description - what is that package for, where does it used, is it part of what bigger structure and which role is it plays?
[ ] Why then create a 'flow' entity? How does flow and pipeline differs?
[ ] Should we create an instance of this class to create an artifact? It is not very clear from this sentence.
[ ] There is no artifact definition like
The processing artifact is a (data structure? info holder?) that stores (data about dataset and links to dependencies (also artifacts))
etc so I still have no idea of what theartifact
is but only what it holds.examples
We have a
flow.yaml
example but no example of how is it used, how to run it, etc. May be you should provide a link to other readme, where this question is exposed.The whole process picture
I've read the readme three times but still have no clue about the whole flow:
...flows are .. pipelines and generates .. artifacts. - then, in other doc part - ... each artifact is convented into pipeline
I have a feeling of going by circle and that I have missed the big part of the whole pipelines picture
datapackage --> flow created --> pipeline --> artifact --> pipeline --> accembly pipeline --> datapackage
If this diagrams should be described on other levels (in other repos) - please provide links to them.