CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Feature/autoignore bad info cache #43

Closed alexjbush closed 5 years ago

alexjbush commented 5 years ago

Description

This include a check of the cache information for a given table. If the regions inferred from cache do not match the partition folders found on the filesystem then all cache information for that table is ignored and it is ready directly from the Parquet files.

This issue it to prevent cache becoming invalid in the case of a failure between writing out the table data and writing new cache information.

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Unit tests. I have tested the cases of:

I both cases I kept one valid table without bad cache information to ensure good tables don't lose the cache optimisation.