FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

Updated to Spark 2.x #113

Closed mlavaert closed 7 years ago

mlavaert commented 7 years ago

I've forked your project to upgrade the Spark version to 2.x and use SparkSessions instead of SQLContext. I got everything right except the python package, normal tests are running fine but the integration tests are failing on a class not found exception. I can't get my head around it.

codecov-io commented 7 years ago

Current coverage is 0.00% (diff: 0.00%)

Merging #113 into master will decrease coverage by 100%

@@           master   #113   diff @@
====================================
  Files          24     24          
  Lines         437    437          
  Methods       421      0   -421   
  Messages        0      0          
  Branches       16      8     -8   
====================================
- Hits          437      0   -437   
- Misses          0    437   +437   
  Partials        0      0          

Powered by Codecov. Last update 183e69b...6708004

FRosner commented 7 years ago

Sorry for not responding for so long. We will finally take the time to review and merge this probably next week.

FRosner commented 7 years ago

I created a local branch issue/117. I'll merge it and close this one. Thanks a lot for the contribution again @mLavaert!

Are you using it at dataminded?

FRosner commented 7 years ago

@mLavaert I was thinking: Should we change all signatures from DataFrame to Dataset[T]? I haven't used the Spark 2 APIs a lot, yet so I am not sure what makes more sense. Can I have your 2 cents?

mlavaert commented 7 years ago

Hey, I'm sorry for replying terribly late. I think it would be better using Dataset[T] instead of DataFrame, because DataFrame is just an alias for Dataset[Row].

It would also allow people to use the framework if they are using Dataset[T]

We are using DDQ at Dataminded, for one of our clients we've forked the framework to create some specific adaptions to it. Mostly to change the layout of the Log4J logger to comply to their Logstash/Kibana standards. At most clients we use the standard version

FRosner commented 7 years ago

Gotcha @mLavaert. I created #132.

Nice to hear that you like to use it. About the format we can also implement a way to customize the format and use the current one as a default. Just thinking. If you want please create an issue and describe the format change you did and we can think whether it can be customized to serve both.