Improve readme - Githubissues

AcckiyGerman commented 6 years ago

As a python developer I want to have more clear readme and definitions so I could use the 'planner' in my work.

Questions I'd like to clarify for me and future developers

[ ] general package description - what is that package for, where does it used, is it part of what bigger structure and which role is it plays?

All processor flows are single pipelines...
[ ] Why then create a 'flow' entity? How does flow and pipeline differs?

The ProcessingArtifact class represents a single such artifact
[ ] Should we create an instance of this class to create an artifact? It is not very clear from this sentence.

ProcessingArtifact ... holds: name, type, dependencies...
[ ] There is no artifact definition like The processing artifact is a (data structure? info holder?) that stores (data about dataset and links to dependencies (also artifacts)) etc so I still have no idea of what the artifact is but only what it holds.

datapackage-pipelines steps
[ ] What is that? Please provide either definition either link to other repo, if such exists.

examples

We have a flow.yaml example but no example of how is it used, how to run it, etc. May be you should provide a link to other readme, where this question is exposed.

The whole process picture

I've read the readme three times but still have no clue about the whole flow:
...flows are .. pipelines and generates .. artifacts. - then, in other doc part - ... each artifact is convented into pipeline

I have a feeling of going by circle and that I have missed the big part of the whole pipelines picture

[ ] where's entry point, who runs all the process?
[ ] please add links to other repos, related to this one, like 'pipelines', 'datapackage' etc
[ ] Would be great to have some kind of flow diagram like:

datapackage --> flow created --> pipeline --> artifact --> pipeline --> accembly pipeline --> datapackage

[ ] Would be great to have a relations structure diagram, like:

datapackage --> pipelines
   ^^^
who runs all thing  >>> S3 storage
   ^^^
amazon instance

If this diagrams should be described on other levels (in other repos) - please provide links to them.

AcckiyGerman commented 6 years ago

flow.yaml example:

meta:
  dataset: <dataset_name>
  findability: public
  # should match username and id from `cat ~/.config/datahub/config.json`
  owner: core
  ownerid: core

inputs:
  -
    kind: datapackage  # currently support only datapackages
    parameters:
      resource-mapping:
        # the link to original data-source file
        <resource-name>: <http://source-site.com/datafile.csv>

      # with the latest changes we don't need the descriptor section
      # any more - it is taken from `.datahub/datapackage.json`
      # if there is no descriptor in `dp.json` - pipeline will
      # infer it automatically from the source file structure.

      # So you can now delete 'descriptor' section!
      # But also you could leave it and add some intermediate
      # resources, if you need to use them in the pipeline later
      # (see the pipeline section below).
      descriptor:
        # this name will be used in the next steps
        name: <resource-name>
        title: Title
        homepage: <http://source-site.com/>
        version: 0.0.1
        license: <MIT, GPL, etc?>
        # This section should match the source data structure.
        # Use datapackage-py infer method to get this in json format.
        # Also you can copy it from `dataset/datapackage.json` file
        # but check schema twice - usually original processing script changed schema
        resources:
          -
            # probably the original data-source will be written in this file
            name: <resource-name> # not sure we need this
            path: <data/resource-name.csv>
            format: csv
            mediatype: text/csv
            "schema":
              "fields":
                -
                  "name": "id"
                  "type": "integer"
                -
                  "name": "type"
                  "type": "string"
                # etc

# the PROCESSING part describes what to do with data, how to 'process' it
# processors are the programs that will wrangle your data. see:
# https://github.com/frictionlessdata/datapackage-pipelines - dpp
# https://github.com/frictionlessdata/tabulator-py - tabulator

# Each processor takes data from "input: <resource-name>",
# do operations that you define in this section
# and saves data into "output: <resource-name>",
# then next processor takes data there and go on
# (for now we use the same name for input and output)
processing:
  - # put this tabulator processor first in a pipeline if the source is zipped
    input: <resource-name>
    tabulator:
      compression: zip
    output: <resource-name>

  # Datapackage-pipeline operations example. dpp docs is here:
  # https://github.com/frictionlessdata/datapackage-pipelines
  - 
    input: <resource-name>
    dpp:
      - # delete some columns:
        run: delete_fields
        parameters:
          resources: <resource-name>
          fields:
            - id
            - home_link
            - keywords
      - # unpivot table
        run: unpivot
        parameters:
          resources: <resource-name>
          extraKeyFields:
            -
              name: year
              type: year
          extraValueField:
              name: value
              type: number
          unpivot:
            -
              name: ([0-9]{4})
              keys:
                year: \1
      - # replace, e.g. quarters to dates:  '1998 Q1' -> 1998-03-31 , Q2 -> 06-31, etc
        run: find_replace
        parameters:
          resources: <resource-name>
          fields:
            -
              name: date
              patterns:
                -
                  find: ([0-9]{4})( Q1)
                  replace: \1-03-31
                -
                  find: ([0-9]{4})( Q2)
                  replace: \1-06-31
                -
                  find: ([0-9]{4})( Q3)
                  replace: \1-09-30
                -
                  find: ([0-9]{4})( Q4)
                  replace: \1-12-31
    output: <resource-name>

# how often to run the automation?
schedule: every 1d

AcckiyGerman commented 6 years ago

@zelima probably we could add the whole Automation guide (it is ready now) here in the planner readme, what do you think? I want to save my work here, coz I put some efforts on it.

zelima commented 6 years ago

@AcckiyGerman we can have flow.md or something like that and link it from the README. All that automation guide is more related to push and CLI, or frontend tutorial section, more than here I think

zelima commented 6 years ago

WONTFIX we are not going to implement automation in this way.

datopian / planner

Improve readme #16

Questions I'd like to clarify for me and future developers

[ ] general package description - what is that package for, where does it used, is it part of what bigger structure and which role is it plays?

[ ] Why then create a 'flow' entity? How does flow and pipeline differs?

[ ] Should we create an instance of this class to create an artifact? It is not very clear from this sentence.

[ ] There is no artifact definition like `The processing artifact is a (data structure? info holder?) that stores (data about dataset and links to dependencies (also artifacts))` etc so I still have no idea of what the `artifact` is but only what it holds.

examples

The whole process picture

datopian / planner

Improve readme #16

Questions I'd like to clarify for me and future developers

[ ] general package description - what is that package for, where does it used, is it part of what bigger structure and which role is it plays?

[ ] Why then create a 'flow' entity? How does flow and pipeline differs?

[ ] Should we create an instance of this class to create an artifact? It is not very clear from this sentence.

[ ] There is no artifact definition like The processing artifact is a (data structure? info holder?) that stores (data about dataset and links to dependencies (also artifacts)) etc so I still have no idea of what the artifact is but only what it holds.

examples

The whole process picture

[ ] There is no artifact definition like `The processing artifact is a (data structure? info holder?) that stores (data about dataset and links to dependencies (also artifacts))` etc so I still have no idea of what the `artifact` is but only what it holds.