datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
195 stars 40 forks source link

Suggested feature: Dataflows DSL #49

Closed OriHoch closed 5 years ago

OriHoch commented 5 years ago
$ dataflows -c '
load "/foo/bar/datapackage.json"
./my-flow.py:my_step "arg_a" "arg_b"
printer
'
FOO | BAR
-------|-------
aaa  | ccc
^^^^^^^^^^^

Create file: my-flow.dataflow

#!/usr/bin/env dataflows

my_module.steps:my_step "${1}" "${2}" '{"baz":"bax"}'
checkpoint

Run it:

$ chmod +x my-flow.dataflow
$ ./my-flow.dataflow "PARAM_1" "PARAM_2"

Saving checkpoint 1

$ dataflows -c '
checkpoint 1
join --source_name=foo --source_key=["my_id"] \
       --source_delete=false --target_name=bar --target_key=["my_id"] \
       --fields={"baz": {}}
checkpoint
'
Loading from checkpoint 1
Saving to checkpoint 2

Related: #48

OriHoch commented 5 years ago

this would be very useful in DevOps / automation, e.g.:

$ pip install dataflows-kubernetes
$ dataflows -c '
kubernetes.get:pods --label=ckan --all-namespaces
filter_rows --not_equals={"Phase": "Running"}
kubernetes.delete:pods --all-namespaces
printer --fields=["pod_name", "Phase", "is_deleted"]
'
pod_name | Phase | is_deleted
---------------|----------| --------------
foobar-9fjh23-2j3j3 | Pending | yes
OriHoch commented 5 years ago

Implementation option

OriHoch commented 5 years ago

implemented here

rufuspollock commented 5 years ago

@OriHoch this is awesome :smile: 🥇