WIP: Prototype for #95 - Githubissues

FRosner commented 8 years ago

@DivyaGe here's a proposal how it could look like. The API to get it back is quite complex at the moment but I don't really know how to make it simpler. Any suggestions?

Proposal for #95

grantgordon commented 7 years ago

My team and I love this project and we'd like to build our data quality functionality on top of it. This particular feature would be very useful for us. We're happy to roll up our sleeves and contribute as well, of course. What would be the best way for us to help here?

FRosner commented 7 years ago

Hi @grantgordon! I'm happy to read that you like DDQ :)

As you can see this PR is open for a while and as the original "requester" did not come back again I didn't bother so much.

I think the main concept behind this change is to add not only aggregated information to the constraint result (i.e. number of rows not being null) but also a reference to a dataframe containing the violating data. However I don't think we can generalize this as not all constraints are row based so we'd have to add this result on a constraint basis.

Which constraints are you mainly interested in? If you want to discuss more interactively, feel free to join the chat at gitter and tell me more about the specific use case you have and the available options to address it. We can discuss either in the lobby or in a private chat :)

grantgordon commented 7 years ago

Thanks for the quick response!

You're absolutely right. I've signed into the gitter room and I'll try to check back throughout the day, but here's a quick summary of our situation and what we're looking to do:

We work on healthcare data, and unfortunately "bad data" is just the weather so we can't have absolute constraints, e.g. the null constraint, in most cases.
For this reason we have to establish quality thresholds we're willing to accept on a table and column basis that are driven by row constraints.
As it is today, DDQ's constraints themselves seem to work great for our value-checking use case, we just have to add on that second layer of checks..

Here are a few example columns and the thresholds we'd need to enforce. Fill rate = 1-% of null values in a column, minimum thresholds would block data from processing, between min and target a warning would be generated:

Column Name	Min Fill Rate	Target Fill Rate	Min Valid Rate	Target Valid Rate
Diagnosis Code	97%	99%	99%	99.5%
Procedure Code	80%	85%	99%	99.5%
Provider ID	95%	99%	98%	99%

Here's an example table threshold:

Row constraint	Min Valid Row Rate	Target Valid Row Rate
col_x = col_y + col_z	97%	99%

At the same time, massaging the data so that we can pass these kinds of thresholds is fairly manually, unfortunately, so being able to hand failed values back to the user for quick profiling would be most helpful:

In our estimation, if we can just get the reference to the dataframe containing the violating data, then we can accomplish all of this.

I believe this would be valuable to all potential users of DDQ. We're also happy to roll up our sleeves and help.

Does that help clarify? Happy to talk more in gitter, or however you'd like.

FRosner commented 7 years ago

Hi @grantgordon,

Gotcha. I see two issues you are describing

1) Access to the data that violates a constraint (clear what that means for row-wise constraints, otherwise we'd have to see case by case) 2) Being more flexible in defining the "quality thresholds", i.e. letting constraints fail based on a configurable amount of quality problems

I think 1. can be solved on a constraint basis as proposed in this PR for the null check.

Number 2. will be very useful and might even render 1. non-necessary if I understand your use case correctly. It could be done by having fuzzy checks that fail not based on a boolean condition but rather succeed to a given percentage, or having custom success criteria (a function that transforms the constraint result data into a constraint result (i.e. success, failure).

I'm just thinking that instead of using DDQ only for the initial checks and then writing your own code to have percentage based success checks immediately after, it might make sense to include this functionality somehow in DDQ as well.

Are you using Python or Scala?

Do you have any time constraints? I just like to discuss new features thoroughly before adding them but if you're in a hurry we can also prototype something in your fork and merge it back when it's more mature :)

grantgordon commented 7 years ago

Yes! I believe that's correct. @log0ymxm does that seem correct to you?

We're using Scala.

I've been thinking about the logic for this similarly. Basically you need to run the lowest-level checks, then you need to run interpretive checks either on summaries of those checks or directly on the results. Two examples:

Example 1

Run regex validity checks on all values in a column (validate)
Calculate the % of invalid values (summarize)
Check to see if % is above acceptable threshold (interpret)

Example 2

Calculate % of dollars on claims that were out of network (summarize)
Check to see if % is above acceptable threshold (interpret)

I also think of it as a simple DAG:

We do have time constraints, but we could spend a week or working on the design of this without a problem. What do you think is the best way to collaborate on this?

FRosner commented 7 years ago

Looks reasonable @grantgordon.

BTW: Since 4.1.0 there are custom dataframe constraints (only available in Scala). You could try defining your own constraints then and customize them based on your thresholds. The basic idea behind the custom constraints was that people can try out new ideas about constraints before they have to add it to the core library. Have you tried if this works for you?

Regarding finding a well designed / integrated solution, I only have one concern / request / comment:

I like the simplicity of the DDQ API and I would like to keep it simple. So if we add additional functionality I think we should still maintain the current API that gives you quick results and just add this functionality on top. Kind of like a starter API and an expert API (like the return type you can use for unit testing).

Current API:

val check = Check(df)
  .hasPrimaryKey("column1")
  .run()

So if we are going to extend it, how should it be? I believe that if we want to change the way constraint results (success, failure) are calculated, it's not compatible with this simple method calls, is it?

How about I walk you through the code a bit in a screen sharing session and we brain storm a bit how we could design it in a flexible way. Then maybe you can propose a draft and we discuss this? You can write me on twitter \@FRosnerd or in a private Gitter chat and I can give you my Skype account name. Hangouts is also an option of course.

grantgordon commented 7 years ago

@FRosner had to take a break from this but we're back with a vengeance. @levismiller is going to be working on this project for us. Are you still interested in collaborating?

FRosner commented 7 years ago

Sure thing. Just ping me tomorrow :)

FRosner / drunken-data-quality

WIP: Prototype for #95 #99