AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Refactor ParquetPartitioningStreamWriter to transformer #118

Closed kevinwallimann closed 4 years ago

kevinwallimann commented 4 years ago

ParquetPartitioningStreamWriter does two things: It adds two columns (i.e. transformation) and writes the dataframe partitioned (special write). With #116 the two responsibilities can be separated: ParquetStreamWriter is enhanced to write partitioned. Thus, only the transformation is left for ParquetPartitioningStreamWriter.

Tasks

How to migrate Hyperdrive-Trigger

  1. Replace

    "component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetPartitioningStreamWriter"

    with

    "component.transformer.id.2=add.date.version", "component.transformer.class.add.date.version=za.co.absa.hyperdrive.ingestor.implementation.transformer.add.dateversion.AddDateVersionTransformer",
    "component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetStreamWriter"
  2. Replace

    writer.parquet.partitioning.report.date

    with

    transformer.add.date.version.report.date
  3. Replace

    "writer.parquet.destination

    with

    "transformer.add.date.version.destination=${writer.parquet.destination}", "writer.parquet.partition.columns=hyperdrive_date, hyperdrive_version", "writer.parquet.destination

    Make sure there is no workflow using ParquetPartitioning and partition columns at the same time