CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Allow repartition by non-named partition for ParquetDataCommiter #53

Closed alexjbush closed 5 years ago

alexjbush commented 5 years ago

Expected Behavior

Dataframes can be repartitioned by a number before being written out to a FileSystem when using the commit blocks.

Actual Behaviour

If a repartition is done on the Dataframe before it is passed to the commit action, it is sometimes ignored as it can be cached as Parquet if the label is reused in the flow. Currently the ParquetDataCommiter API only allows named partitions.