CoxAutomotiveDataSolutions / waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Apache License 2.0
75 stars 16 forks source link

Feature/data quality monitoring #92

Closed vavison closed 5 years ago

vavison commented 5 years ago

Description

Introduces functionality to monitor and alert on data quality for labels, using Amazon's Deequ library https://github.com/awslabs/deequ

There is a programmatic API which exposes the full functionality available in Deequ (https://github.com/CoxAutomotiveDataSolutions/waimak/wiki/data-quality#deequ)

In addition to the programmatic API, there is also a configuration-based API which exposes some common pre-configured checks (https://github.com/CoxAutomotiveDataSolutions/waimak/wiki/Configuration-Extensions#deequ-extension)

There is also the option to not use Deequ and instead use a custom implementation of data quality checking.

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Data quality actions are all thoroughly unit tested.

codecov-io commented 5 years ago

Codecov Report

Merging #92 into develop will increase coverage by 0.88%. The diff coverage is 81.61%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #92      +/-   ##
===========================================
+ Coverage     87.2%   88.08%   +0.88%     
===========================================
  Files           55       74      +19     
  Lines         1555     1771     +216     
  Branches        62       79      +17     
===========================================
+ Hits          1356     1560     +204     
- Misses         199      211      +12
Impacted Files Coverage Δ
...ala/com/coxautodata/waimak/dataflow/DataFlow.scala 97.45% <ø> (+0.84%) :arrow_up:
...ata/waimak/dataflow/spark/SparkActionHelpers.scala 95.83% <ø> (+3.24%) :arrow_up:
...ark/dataquality/DataQualityMetadataExtension.scala 100% <100%> (ø)
...taquality/deequ/prefabchecks/GenericSQLCheck.scala 100% <100%> (ø)
...quality/deequ/prefabchecks/CompletenessCheck.scala 100% <100%> (ø)
...ataquality/DataQualityConfigurationExtension.scala 100% <100%> (ø)
...taflow/spark/dataquality/deequ/DeequMetadata.scala 100% <100%> (ø)
...flow/spark/dataquality/ExceptionQualityAlert.scala 100% <100%> (ø)
...a/waimak/configuration/CaseClassConfigParser.scala 97.29% <100%> (+0.03%) :arrow_up:
...imak/dataflow/spark/dataquality/DatasetCheck.scala 100% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ee88f76...382d457. Read the comment docs.