CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Allow tables in the storage layer to be marked in such a way that they will be deduplicated on compaction with no history retained #64

Closed vavison closed 5 years ago

vavison commented 5 years ago

Expected Behavior

There should be an option to not retain history on certain tables, and as such for them to be deduplicated when compaction happens. This is useful for keeping storage size down in situations where you may have a large number of duplicates in table, for example lookup tables which are extracted into the storage layer in full every time.

Actual Behavior

Currently no deduplication happens for any tables, causing size to grow rapidly if there are high numbers of duplicates in the table.