AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Add option to keep columns in AvroDecodingTransformer #181

Closed kevinwallimann closed 3 years ago

kevinwallimann commented 3 years ago

Background

Currently, the ConfluentAvroDecodingTransformer removes any columns other than key and value by only selecting these.

However, for #177, the offset and partition from the original dataframe from the KafkaStreamReader need to be retained across the ConfluentAvroDecodingTransformer such that the source offset and partition are written to the sink topic for each message.

Feature

An option should be added to specify columns that should be kept across the decoding. If there is a column name collision, an exception should be thrown. Column name collisions may be solved by inserting an instance of the ColumnRenamingStreamTransformer in front of it in the pipeline