CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

Apache License 2.0

75 stars 16 forks source link

Description

This PR introduces a generic way of calculating the number of partitions to use when generating parquet files during a compaction in the storage layer.

There are two implementations to use:

A function that now takes into account the average Row size of rows in a Dataset and splits on a maximum bytes per partition size. By default, the dataset is sampled, the average row size calculated and extrapolated. Note: This approach only calculates the size of the Row objects in the JVM, and does not take into account the compressed size of serialized Parquet objects when they are written, however the size should be correlated.
The existing implementation of calculating the total number of cells per partition.

Also fixes/changes the behaviour of the recompactAll flag to now force a recompaction regardless of whether we are in a compaction window or now.

Fixes #32

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[X] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

How Has This Been Tested?

Unit tests, should test on data during release branch

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala	26	27	96.3%
<!--	Total:	31	32	96.88%	-->

Changes Missing Coverage

Covered Lines

Changed/Added Lines

waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala

96.3%

<!--

Total:

96.88%

-->

Files with Coverage Reduction	New Missed Lines	%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala	1	93.75%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/AuditTableFile.scala	1	96.49%
waimak-rdbm-ingestion/src/main/scala/com/coxautodata/waimak/rdbm/ingestion/RDBMIngestionUtils.scala	1	93.75%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/Storage.scala	1	87.5%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/FileStorageOps.scala	2	90.77%
<!--	Total:	6	-->

Files with Coverage Reduction

New Missed Lines

waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala

93.75%

waimak-storage/src/main/scala/com/coxautodata/waimak/storage/AuditTableFile.scala

96.49%

waimak-rdbm-ingestion/src/main/scala/com/coxautodata/waimak/rdbm/ingestion/RDBMIngestionUtils.scala

93.75%

waimak-storage/src/main/scala/com/coxautodata/waimak/storage/Storage.scala

87.5%

waimak-storage/src/main/scala/com/coxautodata/waimak/storage/FileStorageOps.scala

90.77%