awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Incremental ColumnProfiler #397

Open fzmatt opened 3 years ago

fzmatt commented 3 years ago

I'd like to use a ColumnProfiler to keep track of a previous result together with the current data.

For example we have

case class Student(name: String, surname: String, middleName: Option[String])

and 2 different run (daily basis):

  1. first run yesterday with a Student("ciccio", "pasticcio", None ) which give us a Completeness("middleName") = 0

  2. second run today with a Student("ciccio", "pasticcio", "the best") which has a local Completeness("middleName") = 0 but together with Run 1 I'd like to have a Completeness of 0.5 - (1+0)/2

Code I'm using

val result = ColumnProfilerRunner() .onData(validDf) .restrictToColumns(Seq("middleName")) .useRepository(repository) .reuseExistingResultsForKey(ResultKey(1636556003353L)) .saveOrAppendResult(currentRunResultKey) .run()

where the ResultKey is the key of the first Run.

How to achieve the result? is it possible?

Thanks