frictionlessdata / datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
https://frictionlessdata.io/
MIT License
117 stars 32 forks source link

Ability to set environment variables for a pipeline #181

Closed roll closed 4 years ago

roll commented 4 years ago

Overview

Recently, I've added support for TABLESCHEMA_PRESERVE_MISSING_VALUES env var to tableschema-py - https://github.com/frictionlessdata/tableschema-py#experimental - which can be useful for some use cases.

I propose we have a standard way to declare environment variables for a pipeline as it implemented for many other declarative formats like docker-compose.yml, travis.yml, etc

So we can have something like this:

temporal:
  title: temporal
  description: "temporal format"
  enviroment:
    DEBUG: True
  pipeline:

  - run: load
    parameters:
      from: 'temporal.csv'
      override_fields:
        date:
          outputFormat: '%m/%d/%Y'

  - run: dump_to_path
    parameters:
      out-path: 'output'
      pretty-descriptor: true
      temporal_format_property: outputFormat
roll commented 4 years ago

@akariv WDYT? Does it make sense?

akariv commented 4 years ago

I have to admit I don't necessarily see the use case here that is somewhere in between passing parameters to a processor and using actual environment variables.

passing common parameters to all processors might be achieved more elegantly by creating 'global parameters' which are then passed to all processors, updating the per-processor parameters (elegance is debatable, of course 😄 ).

e.g

temporal:
  title: temporal
  description: "temporal format"
  parameters:
    debug: True
  pipeline:
  - run: load
    parameters:
      from: 'temporal.csv'
      override_fields:
        date:
          outputFormat: '%m/%d/%Y'

  - run: dump_to_path
    parameters:
      out-path: 'output'
      pretty-descriptor: true
      temporal_format_property: outputFormat

Is there any other use case here other than controlling the FD libraries behavior (I'm honestly asking here)?

roll commented 4 years ago

I don't know other use cases but I think this feature still can be general if we think of something like providing env vars for underlying aws library or requests etc

If there are other ways to make it work it should be fine for BCO-DMO. If we could have a custom processor setting env vars it would be enough but I guess it's not possible from a processor

The main goal of this proposal is to make the output of the DPP UI (BCO-DMO are working on) reproducible on CLI. So inside their service, they can set env vars by themselves but outputted DPP specs are going to be run in uncontrolled environments.

roll commented 4 years ago

Hi @akariv,

sorry I didn't understand it completely. Are you against this change? Could you please elaborate?

In general, I see this as kind logical because env variable managements is available in many similar specs like Travis, Docker Compose etc

akariv commented 4 years ago

Hey @roll - given a 2nd thought, I'm okay with this proposal.

roll commented 4 years ago

Cool @akariv. Are you happy with this PR - https://github.com/frictionlessdata/datapackage-pipelines/pull/182?

roll commented 4 years ago

DONE (ready to merge in #182)

cschloer commented 3 years ago

Can a release be created for this update? Thanks.

roll commented 3 years ago

Hi @akariv could you please release?

roll commented 3 years ago

Thanks @akariv!

@cschloer https://pypi.org/project/datapackage-pipelines/