CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Feature/optionally remove storage history on compaction #66

Closed vavison closed 5 years ago

vavison commented 5 years ago

Description

Adds an option to not retain storage history for specific tables, and as such for them to be deduplicated when compaction happens. This is useful for keeping storage size down in situations where you may have a large number of duplicates in the table, for example lookup tables which are extracted into the storage layer in full every time.

For RDBM extraction, the default behaviour is to retain history for tables which have a lastUpdated column, and to not retain history for those without. The choice of this default behaviour is because, without a lastUpdatedColumn, the table will be extracted in full every time extraction is performed, causing the size of the data in storage to grow uncontrollably. To override this behaviour, you will need to set forceRetainStorageHistory on the RDBMExtractionTableConfig for each table.

Fixes #64

Type of change

Please delete options that are not relevant.

This is mostly non-breaking, however an additional, non-optional field has been added to AuditTableInfo, so direct instantiations of this will break.

How Has This Been Tested?

Updated existing unit tests and added new ones.

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 470


Changes Missing Coverage Covered Lines Changed/Added Lines %
waimak-core/src/main/scala/com/coxautodata/waimak/filesystem/FSUtils.scala 0 4 0.0%
<!-- Total: 25 29 86.21% -->
Files with Coverage Reduction New Missed Lines %
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/StorageActions.scala 1 93.75%
waimak-impala/src/main/scala/com/coxautodata/waimak/metastore/ImpalaDBConnector.scala 1 60.0%
waimak-core/src/main/scala/com/coxautodata/waimak/filesystem/FSUtils.scala 1 22.81%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/FileStorageOps.scala 1 92.54%
waimak-core/src/main/scala/com/coxautodata/waimak/dataflow/spark/SparkDataFlow.scala 1 83.87%
waimak-core/src/main/scala/com/coxautodata/waimak/log/Logging.scala 1 30.77%
waimak-rdbm-ingestion/src/main/scala/com/coxautodata/waimak/rdbm/ingestion/RDBMIngestionActions.scala 1 96.15%
waimak-storage/src/main/scala/com/coxautodata/waimak/storage/AuditTableFile.scala 2 95.32%
waimak-core/src/main/scala/com/coxautodata/waimak/dataflow/DataFlow.scala 2 94.03%
waimak-core/src/main/scala/com/coxautodata/waimak/dataflow/spark/SparkActions.scala 3 69.44%
<!-- Total: 14 -->
Totals Coverage Status
Change from base Build 461: -0.03%
Covered Lines: 1190
Relevant Lines: 1488

đź’› - Coveralls