At the moment, our pipeline spec is very opinionated and limited with regards to defining filters and transforms. It is not flexible enough to serve for general-purpose streaming data pipelines.
The main limitations of our current spec are:
Pipelines can define a set of filters and a set of transforms. Transforms are always applied after filters. Users cannot chain transforms and filters in an arbitrary order.
Filters and transforms only operate on the values of Kafka, not the keys.
The current pipeline spec looks as follows (example):
---
spec :
filters: []
transformationSteps:
- name: First
transformations:
- attributeName: name
transformation: trim
- attributeName: email
transformation: trim
filter: not-empty
- attributeName: age
transformation: add-column
transformationConfig:
attributeName: age
defaultValue: 42
I propose changing our pipeline spec as follows:
Allow users to build a sequence of pipeline steps, which can filter and transform at the same time. I.e., filters can be defined at any stage of a pipeline.
Allow users to apply transforms/filters to single fields of records (this is already possible atm) as well as entire records in one go (this is not yet possible). To this end, a step can be of kind: Filter or kind: Record.
Speaking YAML, the spec could look as follows:
---
spec :
steps:
- kind: Field
fields:
name:
transform:
key: trim
email:
transform:
key: hash
config:
algorithm: sha1
filter:
key: not-empty
- kind: Record
transform: user-defined-record-transformation
config:
code: |
def transform(record):
return record
Filters and transforms are referenced by their key.
They can be configured with the object config. Field-level steps allow to apply filters/transforms to single fields of the record's value while Record-level steps allow to operate on entire records, consisting of a key, a value, and metadata.
At the moment, our pipeline spec is very opinionated and limited with regards to defining filters and transforms. It is not flexible enough to serve for general-purpose streaming data pipelines.
The main limitations of our current spec are:
The current pipeline spec looks as follows (example):
I propose changing our pipeline spec as follows:
kind: Filter
orkind: Record
.Speaking YAML, the spec could look as follows:
Filters and transforms are referenced by their key. They can be configured with the object
config
.Field
-level steps allow to apply filters/transforms to single fields of the record's value whileRecord
-level steps allow to operate on entire records, consisting of akey
, avalue
, andmetadata
.