Proposal for a more flexible pipeline specification

At the moment, our pipeline spec is very opinionated and limited with regards to defining filters and transforms. It is not flexible enough to serve for general-purpose streaming data pipelines.

The main limitations of our current spec are:

Pipelines can define a set of filters and a set of transforms. Transforms are always applied after filters. Users cannot chain transforms and filters in an arbitrary order.
Filters and transforms only operate on the values of Kafka, not the keys.

The current pipeline spec looks as follows (example):

---
spec      :
  filters: []
  transformationSteps:
  -   name: First
      transformations:
      -   attributeName: name
          transformation: trim
      -   attributeName: email
          transformation: trim
          filter: not-empty
      -   attributeName: age
          transformation: add-column
          transformationConfig:
            attributeName: age
            defaultValue: 42

I propose changing our pipeline spec as follows:

Allow users to build a sequence of pipeline steps, which can filter and transform at the same time. I.e., filters can be defined at any stage of a pipeline.
Allow users to apply transforms/filters to single fields of records (this is already possible atm) as well as entire records in one go (this is not yet possible). To this end, a step can be of kind: Filter or kind: Record.

Speaking YAML, the spec could look as follows:

---
spec      :
  steps:
  - kind: Field
    fields:
      name:
        transform:
          key: trim
      email:
        transform:
          key: hash
          config:
            algorithm: sha1
        filter:
          key: not-empty
  - kind: Record
    transform: user-defined-record-transformation
    config:
      code: |
        def transform(record):
          return record

Filters and transforms are referenced by their key. They can be configured with the object config. Field-level steps allow to apply filters/transforms to single fields of the record's value while Record-level steps allow to operate on entire records, consisting of a key, a value, and metadata.

DataCater / datacater

Proposal for a more flexible pipeline specification #14