FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

WIP: Prototype for #95 #99

Open FRosner opened 8 years ago

FRosner commented 8 years ago

@DivyaGe here's a proposal how it could look like. The API to get it back is quite complex at the moment but I don't really know how to make it simpler. Any suggestions?

Proposal for #95

grantgordon commented 7 years ago

My team and I love this project and we'd like to build our data quality functionality on top of it. This particular feature would be very useful for us. We're happy to roll up our sleeves and contribute as well, of course. What would be the best way for us to help here?

FRosner commented 7 years ago

Hi @grantgordon! I'm happy to read that you like DDQ :)

As you can see this PR is open for a while and as the original "requester" did not come back again I didn't bother so much.

I think the main concept behind this change is to add not only aggregated information to the constraint result (i.e. number of rows not being null) but also a reference to a dataframe containing the violating data. However I don't think we can generalize this as not all constraints are row based so we'd have to add this result on a constraint basis.

Which constraints are you mainly interested in? If you want to discuss more interactively, feel free to join the chat at gitter and tell me more about the specific use case you have and the available options to address it. We can discuss either in the lobby or in a private chat :)

grantgordon commented 7 years ago

Thanks for the quick response!

You're absolutely right. I've signed into the gitter room and I'll try to check back throughout the day, but here's a quick summary of our situation and what we're looking to do:

Here are a few example columns and the thresholds we'd need to enforce. Fill rate = 1-% of null values in a column, minimum thresholds would block data from processing, between min and target a warning would be generated:

Column Name Min Fill Rate Target Fill Rate Min Valid Rate Target Valid Rate
Diagnosis Code 97% 99% 99% 99.5%
Procedure Code 80% 85% 99% 99.5%
Provider ID 95% 99% 98% 99%

Here's an example table threshold:

Row constraint Min Valid Row Rate Target Valid Row Rate
col_x = col_y + col_z 97% 99%

At the same time, massaging the data so that we can pass these kinds of thresholds is fairly manually, unfortunately, so being able to hand failed values back to the user for quick profiling would be most helpful: image

In our estimation, if we can just get the reference to the dataframe containing the violating data, then we can accomplish all of this.

I believe this would be valuable to all potential users of DDQ. We're also happy to roll up our sleeves and help.

Does that help clarify? Happy to talk more in gitter, or however you'd like.

FRosner commented 7 years ago

Hi @grantgordon,

Gotcha. I see two issues you are describing

1) Access to the data that violates a constraint (clear what that means for row-wise constraints, otherwise we'd have to see case by case) 2) Being more flexible in defining the "quality thresholds", i.e. letting constraints fail based on a configurable amount of quality problems

I think 1. can be solved on a constraint basis as proposed in this PR for the null check.

Number 2. will be very useful and might even render 1. non-necessary if I understand your use case correctly. It could be done by having fuzzy checks that fail not based on a boolean condition but rather succeed to a given percentage, or having custom success criteria (a function that transforms the constraint result data into a constraint result (i.e. success, failure).

I'm just thinking that instead of using DDQ only for the initial checks and then writing your own code to have percentage based success checks immediately after, it might make sense to include this functionality somehow in DDQ as well.

Are you using Python or Scala?

Do you have any time constraints? I just like to discuss new features thoroughly before adding them but if you're in a hurry we can also prototype something in your fork and merge it back when it's more mature :)

grantgordon commented 7 years ago

Yes! I believe that's correct. @log0ymxm does that seem correct to you?

We're using Scala.

I've been thinking about the logic for this similarly. Basically you need to run the lowest-level checks, then you need to run interpretive checks either on summaries of those checks or directly on the results. Two examples:

Example 1

  1. Run regex validity checks on all values in a column (validate)
  2. Calculate the % of invalid values (summarize)
  3. Check to see if % is above acceptable threshold (interpret)

Example 2

  1. Calculate % of dollars on claims that were out of network (summarize)
  2. Check to see if % is above acceptable threshold (interpret)

I also think of it as a simple DAG: image

We do have time constraints, but we could spend a week or working on the design of this without a problem. What do you think is the best way to collaborate on this?

FRosner commented 7 years ago

Looks reasonable @grantgordon.

BTW: Since 4.1.0 there are custom dataframe constraints (only available in Scala). You could try defining your own constraints then and customize them based on your thresholds. The basic idea behind the custom constraints was that people can try out new ideas about constraints before they have to add it to the core library. Have you tried if this works for you?

Regarding finding a well designed / integrated solution, I only have one concern / request / comment:

I like the simplicity of the DDQ API and I would like to keep it simple. So if we add additional functionality I think we should still maintain the current API that gives you quick results and just add this functionality on top. Kind of like a starter API and an expert API (like the return type you can use for unit testing).

Current API:

val check = Check(df)
  .hasPrimaryKey("column1")
  .run()

So if we are going to extend it, how should it be? I believe that if we want to change the way constraint results (success, failure) are calculated, it's not compatible with this simple method calls, is it?

How about I walk you through the code a bit in a screen sharing session and we brain storm a bit how we could design it in a flexible way. Then maybe you can propose a draft and we discuss this? You can write me on twitter \@FRosnerd or in a private Gitter chat and I can give you my Skype account name. Hangouts is also an option of course.

grantgordon commented 7 years ago

@FRosner had to take a break from this but we're back with a vengeance. @levismiller is going to be working on this project for us. Are you still interested in collaborating?

FRosner commented 7 years ago

Sure thing. Just ping me tomorrow :)