With the introduction of the new pipeline subcommand of the Elyra CLI, a zero-ui workflow has become possible. However, the current pipeline format is very verbose and can be difficult to manually view/edit. I think it would be beneficial to users to introduce a new experimental pipeline format that is compatible with the UI, but with a much higher focus on the user story of creating a pipeline from scratch in only a simple text editor. I think this could open up Elyra to users that have difficulties with using our drag-and-drop UI as well as users that would like to use an IDE that doesn't currently have PipelineEditor support yet.
I think that a good case study for this proposal would be the covid-notebooks as the pipelines are moderately complex.
Here is the full contents of the us_data pipeline:
redacted because it was way too long
Much of this information is redundant or ignored by the pipeline-editor component and weighs in @ 596 LOC.
Arguably, the most important part of the pipeline is the nodes array. Here is an example of a single node. It weighs in @ 64 LOC:
❌ type seems a bit redundant to me. Only execution_node and super_node are acknowledged by pipeline-editor and can be easily distinguished without type
✅ op is important
app_data:
✅ properties is important
ui_data:
❌ label dynamically generated at runtime
❌ image dynamically generated at runtime
❌ description dynamically generated at runtime
💅 x_pos this is unneeded if only a text editor is being used, but needed to track in the ui
💅 y_pos this is unneeded if only a text editor is being used, but needed to track in the ui
✅ inputs the only thing here we really need is node_id_ref, the rest can be generated
❌ outputs never actually specified, the information can be generated
with this in mind I think we can reduce a node down to a simple yaml definition:
clean-data: # the node id
op: execute-notebook-node
needs: etl-data # link information
properties:
file: ../notebooks/clean_us_data.ipynb
image: continuumio/anaconda3:2020.07
dependencies:
- inputs/co-est2019-alldata.csv
- util.py
outputs:
- outputs/us_counties_clean.csv
- outputs/us_counties_clean_meta.json
- outputs/us_counties_clean.feather
- outputs/dates.feather
# x_pos, y_pos should be specified elsewhere to keep the node definition clean of any ui-only data
If we repeat this with the rest of the nodes and add the concept of a defaults we could end up with something like this:
With the introduction of the new pipeline subcommand of the Elyra CLI, a zero-ui workflow has become possible. However, the current pipeline format is very verbose and can be difficult to manually view/edit. I think it would be beneficial to users to introduce a new experimental pipeline format that is compatible with the UI, but with a much higher focus on the user story of creating a pipeline from scratch in only a simple text editor. I think this could open up Elyra to users that have difficulties with using our drag-and-drop UI as well as users that would like to use an IDE that doesn't currently have PipelineEditor support yet.
I think that a good case study for this proposal would be the covid-notebooks as the pipelines are moderately complex.
Here is the full contents of the
us_data
pipeline:Much of this information is redundant or ignored by the pipeline-editor component and weighs in @ 596 LOC.
Arguably, the most important part of the pipeline is the nodes array. Here is an example of a single node. It weighs in @ 64 LOC:
Breakdown:
id
is importanttype
seems a bit redundant to me. Onlyexecution_node
andsuper_node
are acknowledged bypipeline-editor
and can be easily distinguished withouttype
op
is importantapp_data
:properties
is importantui_data
:label
dynamically generated at runtimeimage
dynamically generated at runtimedescription
dynamically generated at runtimex_pos
this is unneeded if only a text editor is being used, but needed to track in the uiy_pos
this is unneeded if only a text editor is being used, but needed to track in the uiinputs
the only thing here we really need isnode_id_ref
, the rest can be generatedoutputs
never actually specified, the information can be generatedwith this in mind I think we can reduce a node down to a simple yaml definition:
If we repeat this with the rest of the nodes and add the concept of a
defaults
we could end up with something like this:If you drop the auto-generated ui-hints, comments and newlines, this is only 67 LOC (an entire pipeline in about the size of a single node's json)