AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Add generic partitioning option to ParquetStreamWriter #116

Closed kevinwallimann closed 4 years ago

kevinwallimann commented 4 years ago

Currently, it's not possible to write a dataframe partitioned by arbitrary columns. This feature should be added in this PR.

Tasks

Related info The ParquetPartitioningStreamWriter writes a dataframe partitioned by the current date and with an incrementing version number. However, this is very specialized logic and deserves a dedicated component, but should have a less general name (maybe rename in separate PR). In fact, with this PR, the ParquetPartitioningStreamWriter could be rewritten as a transformer, since it mainly adds two columns