FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

PR for issue/35 #64

Closed FRosner closed 8 years ago

FRosner commented 8 years ago

Rebased version of https://github.com/FRosner/drunken-data-quality/pull/54

codecov-io commented 8 years ago

Current coverage is 100.00%

Merging #64 into master will not affect coverage as of c0e7d9b

@@            master     #64   diff @@
======================================
  Files            4       4       
  Stmts          200     205     +5
  Branches        40      41     +1
  Methods          0       0       
======================================
+ Hit            200     205     +5
  Partial          0       0       
  Missed           0       0       

Review entire Coverage Diff as of c0e7d9b

Powered by Codecov. Updated on successful CI builds.

ghost commented 8 years ago

@FRosner Ok, the merge rate basically tells me the percentage of "survived" (distinct) primary keys after the join with some other table. I just googled again for the term 'merge rate' with respect to joins/SQL but was not successful in finding an official definition. So I think we can just leave it like this and document it properly.

FRosner commented 8 years ago

Can you please review the PR one more time @bkomboz @zchen?

ghost commented 8 years ago

@FRosner Overall it looks good to me. Minor detail: Should variable unmatchedKeysPercentage not better be named matchedKeysPercentage? Because the following test:

"A joinable check" should "succeed if a join on the given column yields at least one row" in {
val base = makeIntegerDf(List(1, 1, 1, 2, 2, 3))
val ref = makeIntegerDf(List(1, 2, 5))
val check = Check(base).isJoinableWith(ref, "column" -> "column")
val constraint = check.constraints.head
val result = ConstraintSuccess("Column column->column can be used for joining (" +
  "join columns cardinality in base table: 3, " +
  "join columns cardinality after joining: 2 (67%)")
check.run().constraintResults shouldBe Map(constraint -> result)
}

gives 67% (percentage of matches keys) and not 33% (percentage of unmatched keys).

FRosner commented 8 years ago

You're actually right. Thanks!

On 09 Feb 2016, at 13:04, Basil Komboz notifications@github.com wrote:

@FRosner https://github.com/FRosner Overall it looks good to me. Minor detail: Should variable unmatchedKeysPercentage not better be named matchedKeysPercentage? Because the following test:

"A joinable check" should "succeed if a join on the given column yields at least one row" in { val base = makeIntegerDf(List(1, 1, 1, 2, 2, 3)) val ref = makeIntegerDf(List(1, 2, 5)) val check = Check(base).isJoinableWith(ref, "column" -> "column") val constraint = check.constraints.head val result = ConstraintSuccess("Column column->column can be used for joining (" + "join columns cardinality in base table: 3, " + "join columns cardinality after joining: 2 (67%)") check.run().constraintResults shouldBe Map(constraint -> result) } gives 67% (percentage of matches keys) and not 33% (percentage of unmatched keys).

— Reply to this email directly or view it on GitHub https://github.com/FRosner/drunken-data-quality/pull/64#issuecomment-181838225.