Closed FRosner closed 8 years ago
We have two options here.
Do you have an opinion on that one @Gerrrr?
I think, that we should catch exceptions in any case somewhere around here and produce ConstraintException
(or ConstraintError
) with an exception string as a message. This way you have specific messages for different exceptions and a stack trace as well.
Local try-catches for each constraint are nice and should be there, but having global try-catch, as you said, will provide us with a guarantee that the program will not crash in case of a missed local try-check.
Thanks for your opinion @Gerrrr. This is exactly the place where I put it right now.
Let me give an example: I put a constraint that checks whether column a is not null. However, there is no column a because something went wrong with the CSV file. If I put the check there it will probably give us an exception message that looks something like "Could not resolve column a.". This is ok and works fine.
Only problem I am having is how to implement this. Right now I have a constraint and a specific result for this constraint. This ConstraintError
would be an exception to this pattern because it is a special result that can occur for all constraints. So if I do this it would need to implement the ConstraintResult
interface. That means it needs to have a constraint (the constraint that threw an exception :+1:), a message (the message of the exception :+1:) and a status (what status? :-1:). What status shall we use? If I introduce a new status, all other constraint results have to also deal with it, although they will never see it. If I use Failure, the serialization to JSON will have to make an exception to map that accordingly. It's difficult. If you want we can discuss in a screen sharing session tonight or tomorrow night :) Let me know.
You can add de.frosner.ddq.constraints.ConstraintError
which extends de.frosner.ddq.constraints.ConstraintStatus
. This way ConstraintResult
can have ConstraintError
status, returned either by Constraint.fun
itself (you caught an exception inside constraint and returned the appropriate status) or in the runner if you caught an exception there.
What do you think?
Ah so you are suggesting that we implement both? If Constraint.fun
does not throw an exception it can still return a ConstraintError
state? So we have a ErroredConstraintResult
if something goes wrong in the runner, that always has ConstraintError
as a status but then also each constraint result can have a ConstraintError
state individually?
But if you look at ConditionalColumnConstraintResult
for example, it also has a number of failing rows as an argument. This would not be valued if the constraint errored. But I don't want to make everything optional everywhere.
What about we move the constraint status out of the result so that we can have a status without a result? A result will then simply keep fields that are there in case there was no error? So a CheckResult
would not have a Map[Constraint, ConstraintResult[Constraint]]
but a Map[Constraint, (ConstraintStatus, Option[ConstraintResult[Constraint]]]
? I think I need to sleep a bit to get my head around this. Thanks again for your input. Feel free to comment again @Gerrrr :)
I thought a bit more about this @Gerrrr and I think that the two cases (general and constraint specific error) differ significantly.
If there is an error inside the constraint function that is detected by the constraint, it is some kind of constraint result. It means that the constraint was expecting that something can go wrong based on the data and reflects this in the error state.
If there is an error inside the constraint function that is not handled by it, it could be caught by the runner. However, one should at least be able to decide whether the runner should catch this error, because it might mean that there is something seriously wrong (OOM, etc.). And we might not want to swallow exceptions not related to data quality (missing column etc.) and mix them with exceptions related to it in the dashboards.
What do you think?
I don't like to copy and paste the try code everywhere. I will think of a better solution.
Description
If a constraint cannot be checked, because there is an exception (e.g. column not existing), it should not crash the program but just let the constraint enter an error state.
Open Points
[warn] /Users/frosner/Documents/projects/drunken-data-quality/src/main/scala/de/frosner/ddq/constraints/NeverNullConstraintResult.scala:11: match may not be exhaustive.
[warn] It would fail on the following inputs: (??, None, None, Some(_)), (??, Option(), None, Some(_)), (ConstraintError(_), None, None, Some(_)), (ConstraintError(_), None, Some(_), None), (ConstraintError(_), None, Some(_), Some(_)), (ConstraintError(_), Option(), None, Some(_)), (ConstraintError(_), Option(), Option(), Some(_)), (ConstraintError(_), Option(), Some(_), None), (ConstraintError(_), Some(_), None, Some(_)), (ConstraintError(_), Some(_), Some(_), None), (ConstraintFailure, None, None, None), (ConstraintFailure, None, None, Some(_)), (ConstraintFailure, Option(), None, None), (ConstraintFailure, Option(), None, Some(_)), (ConstraintSuccess, None, None, Some(_)), (ConstraintSuccess, None, Some(_), Some(_))