Currently, it's not possible to write a dataframe partitioned by arbitrary columns. This feature should be added in this PR.
Tasks
Add a configuration option writer.parquet.partitionby. It should accept a comma-separated list of column names
If present, the AbstractParquetStreamWriter should call .partitionBy on the DataStreamWriter
Related info
The ParquetPartitioningStreamWriter writes a dataframe partitioned by the current date and with an incrementing version number. However, this is very specialized logic and deserves a dedicated component, but should have a less general name (maybe rename in separate PR). In fact, with this PR, the ParquetPartitioningStreamWriter could be rewritten as a transformer, since it mainly adds two columns
Currently, it's not possible to write a dataframe partitioned by arbitrary columns. This feature should be added in this PR.
Tasks
writer.parquet.partitionby
. It should accept a comma-separated list of column namesAbstractParquetStreamWriter
should call.partitionBy
on the DataStreamWriterRelated info The
ParquetPartitioningStreamWriter
writes a dataframe partitioned by the current date and with an incrementing version number. However, this is very specialized logic and deserves a dedicated component, but should have a less general name (maybe rename in separate PR). In fact, with this PR, theParquetPartitioningStreamWriter
could be rewritten as a transformer, since it mainly adds two columns