FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

Custom constraints on dataframes #128

Closed FRosner closed 7 years ago

FRosner commented 7 years ago

Problem

If you want to add a constraint that is not implemented, it would be nice to open a new issue at DDQ! Until the new constraint is implemented and released, you should be able to implement the constraint as a custom one by yourself.

Draft

There should be a custom constraint function, which takes the constraint name and a function operating on the dataframe and returning either a success or a failure message, as a parameter.

Check(customers)
  .hasNumRows(_ >= 3)
  .hasUniqueKey("id")
  .custom(
    "number of columns = 100",
    (df: DataFrame) => if (df.columns.size == 100) Right("number of columns match") else Left(s"number of columns were ${df.columns.size}")
  ).run()

Documentation

https://github.com/FRosner/drunken-data-quality/wiki/Drunken-Data-Quality-4.1.0#custom-constraints