datopian / planner

Plan processing based on spec
MIT License
5 stars 1 forks source link

Improve readme #16

Closed AcckiyGerman closed 6 years ago

AcckiyGerman commented 6 years ago

As a python developer I want to have more clear readme and definitions so I could use the 'planner' in my work.

Questions I'd like to clarify for me and future developers

examples

We have a flow.yaml example but no example of how is it used, how to run it, etc. May be you should provide a link to other readme, where this question is exposed.

The whole process picture

I've read the readme three times but still have no clue about the whole flow:
...flows are .. pipelines and generates .. artifacts. - then, in other doc part - ... each artifact is convented into pipeline

I have a feeling of going by circle and that I have missed the big part of the whole pipelines picture

datapackage --> flow created --> pipeline --> artifact --> pipeline --> accembly pipeline --> datapackage

datapackage --> pipelines
   ^^^
who runs all thing  >>> S3 storage
   ^^^
amazon instance

If this diagrams should be described on other levels (in other repos) - please provide links to them.

AcckiyGerman commented 6 years ago

flow.yaml example:

meta:
  dataset: <dataset_name>
  findability: public
  # should match username and id from `cat ~/.config/datahub/config.json`
  owner: core
  ownerid: core

inputs:
  -
    kind: datapackage  # currently support only datapackages
    parameters:
      resource-mapping:
        # the link to original data-source file
        <resource-name>: <http://source-site.com/datafile.csv>

      # with the latest changes we don't need the descriptor section
      # any more - it is taken from `.datahub/datapackage.json`
      # if there is no descriptor in `dp.json` - pipeline will
      # infer it automatically from the source file structure.

      # So you can now delete 'descriptor' section!
      # But also you could leave it and add some intermediate
      # resources, if you need to use them in the pipeline later
      # (see the pipeline section below).
      descriptor:
        # this name will be used in the next steps
        name: <resource-name>
        title: Title
        homepage: <http://source-site.com/>
        version: 0.0.1
        license: <MIT, GPL, etc?>
        # This section should match the source data structure.
        # Use datapackage-py infer method to get this in json format.
        # Also you can copy it from `dataset/datapackage.json` file
        # but check schema twice - usually original processing script changed schema
        resources:
          -
            # probably the original data-source will be written in this file
            name: <resource-name> # not sure we need this
            path: <data/resource-name.csv>
            format: csv
            mediatype: text/csv
            "schema":
              "fields":
                -
                  "name": "id"
                  "type": "integer"
                -
                  "name": "type"
                  "type": "string"
                # etc

# the PROCESSING part describes what to do with data, how to 'process' it
# processors are the programs that will wrangle your data. see:
# https://github.com/frictionlessdata/datapackage-pipelines - dpp
# https://github.com/frictionlessdata/tabulator-py - tabulator

# Each processor takes data from "input: <resource-name>",
# do operations that you define in this section
# and saves data into "output: <resource-name>",
# then next processor takes data there and go on
# (for now we use the same name for input and output)
processing:
  - # put this tabulator processor first in a pipeline if the source is zipped
    input: <resource-name>
    tabulator:
      compression: zip
    output: <resource-name>

  # Datapackage-pipeline operations example. dpp docs is here:
  # https://github.com/frictionlessdata/datapackage-pipelines
  - 
    input: <resource-name>
    dpp:
      - # delete some columns:
        run: delete_fields
        parameters:
          resources: <resource-name>
          fields:
            - id
            - home_link
            - keywords
      - # unpivot table
        run: unpivot
        parameters:
          resources: <resource-name>
          extraKeyFields:
            -
              name: year
              type: year
          extraValueField:
              name: value
              type: number
          unpivot:
            -
              name: ([0-9]{4})
              keys:
                year: \1
      - # replace, e.g. quarters to dates:  '1998 Q1' -> 1998-03-31 , Q2 -> 06-31, etc
        run: find_replace
        parameters:
          resources: <resource-name>
          fields:
            -
              name: date
              patterns:
                -
                  find: ([0-9]{4})( Q1)
                  replace: \1-03-31
                -
                  find: ([0-9]{4})( Q2)
                  replace: \1-06-31
                -
                  find: ([0-9]{4})( Q3)
                  replace: \1-09-30
                -
                  find: ([0-9]{4})( Q4)
                  replace: \1-12-31
    output: <resource-name>

# how often to run the automation?
schedule: every 1d
AcckiyGerman commented 6 years ago

@zelima probably we could add the whole Automation guide (it is ready now) here in the planner readme, what do you think? I want to save my work here, coz I put some efforts on it.

zelima commented 6 years ago

@AcckiyGerman we can have flow.md or something like that and link it from the README. All that automation guide is more related to push and CLI, or frontend tutorial section, more than here I think

zelima commented 6 years ago

WONTFIX we are not going to implement automation in this way.