CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Produce consistently-sized, width-independent files when performing storage layer compaction #39

Closed vavison closed 5 years ago

vavison commented 5 years ago

Description

The size of the files created by storage layer compaction now takes into account the width of the data as well as the number of rows. Very wide tables no longer produce very large files (their size is consistent with narrower tables)

Fixes the underlying issue behind #32 (However, will not address making region sizes independent of the width of the data as that would require adding the schema information to the AuditTableRegionInfo which is a much bigger change for not as much gain)

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Added unit tests for verifying the number of output files produced is determined by both the width and length of the data

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 328


Files with Coverage Reduction New Missed Lines %
waimak-hive/src/main/scala/com/coxautodata/waimak/metastore/HiveDBConnector.scala 1 96.43%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/AuditTableFile.scala 1 96.13%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala 1 85.11%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/FileStorageOps.scala 1 93.65%
waimak-rdbm-ingestion/src/main/scala/com/coxautodata/waimak/rdbm/ingestion/RDBMIngestionActions.scala 1 94.59%
<!-- Total: 5 -->
Totals Coverage Status
Change from base Build 325: -0.04%
Covered Lines: 1082
Relevant Lines: 1371

💛 - Coveralls