Closed alexjbush closed 5 years ago
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala | 26 | 27 | 96.3% | ||
<!-- | Total: | 31 | 32 | 96.88% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala | 1 | 93.75% | ||
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/AuditTableFile.scala | 1 | 96.49% | ||
waimak-rdbm-ingestion/src/main/scala/com/coxautodata/waimak/rdbm/ingestion/RDBMIngestionUtils.scala | 1 | 93.75% | ||
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/Storage.scala | 1 | 87.5% | ||
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/FileStorageOps.scala | 2 | 90.77% | ||
<!-- | Total: | 6 | --> |
Totals | |
---|---|
Change from base Build 422: | 0.1% |
Covered Lines: | 1159 |
Relevant Lines: | 1457 |
Description
This PR introduces a generic way of calculating the number of partitions to use when generating parquet files during a compaction in the storage layer.
There are two implementations to use:
A function that now takes into account the average
Row
size of rows in a Dataset and splits on a maximum bytes per partition size. By default, the dataset is sampled, the average row size calculated and extrapolated. Note: This approach only calculates the size of the Row objects in the JVM, and does not take into account the compressed size of serialized Parquet objects when they are written, however the size should be correlated.The existing implementation of calculating the total number of cells per partition.
Also fixes/changes the behaviour of the
recompactAll
flag to now force a recompaction regardless of whether we are in a compaction window or now.Fixes #32
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Unit tests, should test on data during release branch