FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

Exception handling and constraint error state #73

Closed FRosner closed 8 years ago

FRosner commented 8 years ago

Description

If a constraint cannot be checked, because there is an exception (e.g. column not existing), it should not crash the program but just let the constraint enter an error state.

Open Points

FRosner commented 8 years ago

We have two options here.

  1. Implement a global try-catch on the runner level that will result in a generic error state in case a constraint check throws an exception. This makes it impossible for a constraint to crash the program even if the developer missed an exception. But the result will be very generic.
  2. Implement local try-catches for each constraint on the function level. This allows specific messages for different exceptions. It gives the constraint developer the flexibility but also responsibility to handle exceptions.

Do you have an opinion on that one @Gerrrr?

Gerrrr commented 8 years ago

I think, that we should catch exceptions in any case somewhere around here and produce ConstraintException (or ConstraintError) with an exception string as a message. This way you have specific messages for different exceptions and a stack trace as well.

Local try-catches for each constraint are nice and should be there, but having global try-catch, as you said, will provide us with a guarantee that the program will not crash in case of a missed local try-check.

FRosner commented 8 years ago

Thanks for your opinion @Gerrrr. This is exactly the place where I put it right now.

Let me give an example: I put a constraint that checks whether column a is not null. However, there is no column a because something went wrong with the CSV file. If I put the check there it will probably give us an exception message that looks something like "Could not resolve column a.". This is ok and works fine.

Only problem I am having is how to implement this. Right now I have a constraint and a specific result for this constraint. This ConstraintError would be an exception to this pattern because it is a special result that can occur for all constraints. So if I do this it would need to implement the ConstraintResult interface. That means it needs to have a constraint (the constraint that threw an exception :+1:), a message (the message of the exception :+1:) and a status (what status? :-1:). What status shall we use? If I introduce a new status, all other constraint results have to also deal with it, although they will never see it. If I use Failure, the serialization to JSON will have to make an exception to map that accordingly. It's difficult. If you want we can discuss in a screen sharing session tonight or tomorrow night :) Let me know.

Gerrrr commented 8 years ago

You can add de.frosner.ddq.constraints.ConstraintError which extends de.frosner.ddq.constraints.ConstraintStatus. This way ConstraintResult can have ConstraintErrorstatus, returned either by Constraint.fun itself (you caught an exception inside constraint and returned the appropriate status) or in the runner if you caught an exception there. What do you think?

FRosner commented 8 years ago

Ah so you are suggesting that we implement both? If Constraint.fun does not throw an exception it can still return a ConstraintError state? So we have a ErroredConstraintResult if something goes wrong in the runner, that always has ConstraintError as a status but then also each constraint result can have a ConstraintError state individually?

But if you look at ConditionalColumnConstraintResult for example, it also has a number of failing rows as an argument. This would not be valued if the constraint errored. But I don't want to make everything optional everywhere.

What about we move the constraint status out of the result so that we can have a status without a result? A result will then simply keep fields that are there in case there was no error? So a CheckResult would not have a Map[Constraint, ConstraintResult[Constraint]]but a Map[Constraint, (ConstraintStatus, Option[ConstraintResult[Constraint]]]? I think I need to sleep a bit to get my head around this. Thanks again for your input. Feel free to comment again @Gerrrr :)

FRosner commented 8 years ago

I thought a bit more about this @Gerrrr and I think that the two cases (general and constraint specific error) differ significantly.

If there is an error inside the constraint function that is detected by the constraint, it is some kind of constraint result. It means that the constraint was expecting that something can go wrong based on the data and reflects this in the error state.

If there is an error inside the constraint function that is not handled by it, it could be caught by the runner. However, one should at least be able to decide whether the runner should catch this error, because it might mean that there is something seriously wrong (OOM, etc.). And we might not want to swallow exceptions not related to data quality (missing column etc.) and mix them with exceptions related to it in the dashboards.

What do you think?

FRosner commented 8 years ago

image

FRosner commented 8 years ago

I don't like to copy and paste the try code everywhere. I will think of a better solution.