[RFC] Proposal for an experimental pipeline format

With the introduction of the new pipeline subcommand of the Elyra CLI, a zero-ui workflow has become possible. However, the current pipeline format is very verbose and can be difficult to manually view/edit. I think it would be beneficial to users to introduce a new experimental pipeline format that is compatible with the UI, but with a much higher focus on the user story of creating a pipeline from scratch in only a simple text editor. I think this could open up Elyra to users that have difficulties with using our drag-and-drop UI as well as users that would like to use an IDE that doesn't currently have PipelineEditor support yet.

I think that a good case study for this proposal would be the covid-notebooks as the pipelines are moderately complex.

Here is the full contents of the us_data pipeline:

redacted because it was way too long

Much of this information is redundant or ignored by the pipeline-editor component and weighs in @ 596 LOC.

Arguably, the most important part of the pipeline is the nodes array. Here is an example of a single node. It weighs in @ 64 LOC:

{
  "id": "130d1e75-9585-480b-8c98-c86aaf051159",
  "type": "execution_node",
  "op": "execute-notebook-node",
  "app_data": {
    "filename": "../notebooks/clean_us_data.ipynb",
    "runtime_image": "continuumio/anaconda3:2020.07",
    "dependencies": [
      "util.py",
      "inputs/co-est2019-alldata.csv"
    ],
    "include_subdirectories": false,
    "env_vars": [],
    "outputs": [
      "outputs/us_counties_clean.csv",
      "outputs/us_counties_clean_meta.json",
      "outputs/us_counties_clean.feather",
      "outputs/dates.feather"
    ],
    "invalidNodeError": null,
    "ui_data": {
      "label": "clean_us_data.ipynb",
      "image": "data:image/svg+xml;utf8,%3Csvg%20opacity%3D%220.8%22%20version%3D%222.0%22%20viewBox%3D%220%200%20300%20300%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Afigma%3D%22http%3A%2F%2Fwww.figma.com%2Ffigma%2Fns%22%3E%3Ctitle%3Elogo.svg%3C%2Ftitle%3E%3Cdesc%3ECreated%20using%20Figma%200.90%3C%2Fdesc%3E%3Cg%20id%3D%22Canvas%22%20transform%3D%22translate(-1638%2C-1844)%22%20figma%3Atype%3D%22canvas%22%3E%3Cg%20id%3D%22logo%22%20style%3D%22mix-blend-mode%3Anormal%22%20figma%3Atype%3D%22group%22%3E%3Cpath%20d%3D%22m1788%201886a108.02%20108.02%200%200%200%20-104.92%2082.828%20114.07%2064.249%200%200%201%20104.92%20-39.053%20114.07%2064.249%200%200%201%20104.96%2039.261%20108.02%20108.02%200%200%200%20-104.96%20-83.037zm-104.96%20133.01a108.02%20108.02%200%200%200%20104.96%2083.037%20108.02%20108.02%200%200%200%20104.92%20-82.828%20114.07%2064.249%200%200%201%20-104.92%2039.053%20114.07%2064.249%200%200%201%20-104.96%20-39.261z%22%20style%3D%22fill%3A%23f57c00%3Bpaint-order%3Afill%20markers%20stroke%22%2F%3E%3Ccircle%20cx%3D%221699.5%22%20cy%3D%222110.8%22%20r%3D%2222.627%22%20style%3D%22fill%3A%239e9e9e%3Bpaint-order%3Afill%20markers%20stroke%22%2F%3E%3Ccircle%20cx%3D%221684.3%22%20cy%3D%221892.6%22%20r%3D%2216.617%22%20style%3D%22fill%3A%23616161%3Bmix-blend-mode%3Anormal%3Bpaint-order%3Afill%20markers%20stroke%22%2F%3E%3Ccircle%20cx%3D%221879.8%22%20cy%3D%221877.4%22%20r%3D%2221.213%22%20style%3D%22fill%3A%23757575%3Bmix-blend-mode%3Anormal%3Bpaint-order%3Afill%20markers%20stroke%22%2F%3E%3C%2Fg%3E%3C%2Fg%3E%3C%2Fsvg%3E%0D%0A",
      "x_pos": 290,
      "y_pos": 300,
      "description": "Notebook file"
    }
  },
  "inputs": [
    {
      "id": "inPort",
      "app_data": {
        "ui_data": {
          "cardinality": {
            "min": 0,
            "max": 1
          },
          "label": "Input Port"
        }
      },
      "links": [
        {
          "id": "92395b96-c429-423e-a2d0-1c7bc8b6dfd1",
          "node_id_ref": "58f6dde7-bd2b-42f4-b681-e2176b47e0cd",
          "port_id_ref": "outPort"
        }
      ]
    }
  ],
  "outputs": [
    {
      "id": "outPort",
      "app_data": {
        "ui_data": {
          "cardinality": {
            "min": 0,
            "max": -1
          },
          "label": "Output Port"
        }
      }
    }
  ]
}

Breakdown:

✅ id is important
❌ type seems a bit redundant to me. Only execution_node and super_node are acknowledged by pipeline-editor and can be easily distinguished without type
✅ op is important
app_data:
- ✅ properties is important
- ui_data:
- ❌ label dynamically generated at runtime
- ❌ image dynamically generated at runtime
- ❌ description dynamically generated at runtime
- 💅 x_pos this is unneeded if only a text editor is being used, but needed to track in the ui
- 💅 y_pos this is unneeded if only a text editor is being used, but needed to track in the ui
✅ inputs the only thing here we really need is node_id_ref, the rest can be generated
❌ outputs never actually specified, the information can be generated

with this in mind I think we can reduce a node down to a simple yaml definition:

clean-data: # the node id
  op: execute-notebook-node
  needs: etl-data # link information
  properties:
    file: ../notebooks/clean_us_data.ipynb
    image: continuumio/anaconda3:2020.07
    dependencies:
      - inputs/co-est2019-alldata.csv
      - util.py
    outputs:
      - outputs/us_counties_clean.csv
      - outputs/us_counties_clean_meta.json
      - outputs/us_counties_clean.feather
      - outputs/dates.feather
# x_pos, y_pos should be specified elsewhere to keep the node definition clean of any ui-only data

If we repeat this with the rest of the nodes and add the concept of a defaults we could end up with something like this:

defaults:
  properties:
    image: continuumio/anaconda3:2020.07
    dependencies:
      - util.py

nodes:
  # Extract and transform
  etl-data:
    op: execute-notebook-node
    properties:
      file: ../notebooks/etl_us_data.ipynb
      outputs:
        - outputs/us_counties.csv
        - outputs/us_counties_meta.json

  etl-census:
    op: execute-notebook-node
    properties:
      file: ../notebooks/etl_us_census.ipynb
      dependencies:
        # should this completely override or append to the "defaults"?
        - inputs/ACS*
      outputs:
        - outputs/us_counties_income.csv

  # Clean
  clean-data:
    op: execute-notebook-node
    needs: etl-data
    properties:
      file: ../notebooks/clean_us_data.ipynb
      dependencies:
        - inputs/co-est2019-alldata.csv
      outputs:
        - outputs/us_counties_clean.csv
        - outputs/us_counties_clean_meta.json
        - outputs/us_counties_clean.feather
        - outputs/dates.feather

  # Model
  fit-data:
    op: execute-notebook-node
    needs: clean-data
    properties:
      file: ../notebooks/fit_us_data.ipynb
      outputs:
        - outputs/us_counties_curves.csv
        - outputs/us_counties_curves_meta.json
        - outputs/us_counties_curves_params.csv

  # Analyze
  demographics-data:
    op: execute-notebook-node
    needs: [etl-census, clean-data]
    properties:
      file: ../notebooks/demographics_us_data.ipynb

  analyze-fit-data:
    op: execute-notebook-node
    needs: fit-data
    properties:
      file: ../notebooks/analyze_fit_us_data.ipynb

  maps-data:
    op: execute-notebook-node
    needs: fit-data
    properties:
      file: ../notebooks/maps_us_data.ipynb

  tables-data:
    op: execute-notebook-node
    needs: fit-data
    properties:
      file: ../notebooks/tables_us_data.ipynb

  timeseries-data:
    op: execute-notebook-node
    needs: fit-data
    properties:
      file: ../notebooks/timeseries_us_data.ipynb

# is comment information used in runtime submission at all?
comments: 
  extract:
    message: Extract and transform
    for: [etl-data, etl-census]
  clean:
    message: Clean
    for: clean-data
  model:
    message: Model
    for: fit-data
  analyze:
    message: Analyze
    for: [demographics-data, analyze-fit-data, maps-data, tables-data, timeseries-data]

# throwaway ui data for positioning, if discarded the ui will try to find an 
# okay place to put the node and re-generate this information.
ui-hints: 
  nodes:
    etl-data: { x-pos: 48, y-pos: 300 }
    etl-census: { x-pos: 51, y-pos: 59 }
    clean-data: { x-pos: 290, y-pos: 300 }
    fit-data: { x-pos: 625, y-pos: 169 }
    demographics-data: { x-pos: 681, y-pos: 55 }
    analyze-fit-data: { x-pos: 914, y-pos: 122 }
    maps-data: { x-pos: 681, y-pos: 297 }
    tables-data: { x-pos: 681, y-pos: 416 }
    timeseries-data: { x-pos: 681, y-pos: 536 }
  comments: 
    extract: { x-pos: 55, y-pos: 186, width: 158, height: 30 }
    clean: { x-pos: 339, y-pos: 188, width: 61, height: 28 }
    model: { x-pos: 679, y-pos: 229, width: 62, height: 30 }
    analyze: { x-pos: 913, y-pos: 248, width: 69, height: 29 }

If you drop the auto-generated ui-hints, comments and newlines, this is only 67 LOC (an entire pipeline in about the size of a single node's json)

elyra-ai / elyra

[RFC] Proposal for an experimental pipeline format #1560