FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

PyDDQ ZeppelinReporter #106

Closed Gerrrr closed 8 years ago

Gerrrr commented 8 years ago
%pyspark
from pyddq.core import Check
from pyddq.reporters import ZeppelinReporter

df = sqlContext.createDataFrame([(1, "a"), (1, None), (3, "c")])
check = Check(df)
reporter = ZeppelinReporter(z)
check.hasUniqueKey("_1", "_2").isNeverNull("_1").run([reporter])

z (passed into ZeppelinReporter) is a PyZeppelinContext. It is available in every Zeppelin pyspark note and should be passed in order to direct the output to the correct cell.

codecov-io commented 8 years ago

Current coverage is 100% (diff: 100%)

Merging #106 into master will not change coverage

@@           master   #106   diff @@
====================================
  Files          24     24          
  Lines         437    437          
  Methods       421    421          
  Messages        0      0          
  Branches       16     16          
====================================
  Hits          437    437          
  Misses          0      0          
  Partials        0      0          

Powered by Codecov. Last update 320fa32...ad90de5

FRosner commented 8 years ago

LGTM. Did you test whether the README.md examples still work, @Gerrrr?

Gerrrr commented 8 years ago

Examples from the README.md do not work in Zeppelin because in the README.md we try to show different output destinations and reporters. In Zeppelin it only makes sense to use pyddq.reporters.ZeppelinReporter.