FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

isConvertibleToDate should support empty values #29

Open ghost opened 9 years ago

ghost commented 9 years ago

Method isConvertibleToDate currently doesn't support empty columns (string with spaces only). If you use it to check a column for a date format and the column contains some empty values, this method will always report errors.

Additional parameter allowEmptyValues: boolean = false would fix this.

FRosner commented 9 years ago

@mfsny what is the use case for allowing empty values? If you want to check if a column is convertible to a date then it should return the exact results that you would get when actually converting it. Don't you think?

ghost commented 9 years ago

Well, I have a table with a column, which I want to check for a date format, but this column can be empty (spaces, not null). At the moment my check fails because it cannot convert spaces to the date format.

FRosner commented 9 years ago

Of course but I was wondering whether this is a valid case for a general check. This could happen also to booleans, numerics, etc. Question is whether we should add this because to me it sounds like a special case.

In the end you need to understand what you want to do with the result of this check. What would you want to do with a column that can contain dates or empty strings? If you convert it to date it will throw exceptions for the empty strings.

If you want it to be null then you should convert the empty strings to null first and then do the check, or?

ghost commented 9 years ago

That's right, this is not date-specific, the same applies to all other data types.

How would you write a check, which accepts both empty values and date-formatted strings? Would you got with satisfies("trim(col1)='' or col1 is in date format")?

FRosner commented 9 years ago
Check(myDf.filter(myDf("myStringDateColumn") !== "").isConvertibleToDate("myStringDateColumn").run

Could be an option although it is ugly, because it changes the data frame and not only the column. Having these exceptions does not only apply to all data types but also to all possible types of exceptions. You might want to check whether a column is either convertible to Date or "NO_DATE", which you later can convert to null.

So in my opinion we should not extend the convertible check functions but rather think of a way to make them more flexible, combinable and chainable. This could by done by passing a function or something to transform the column before doing the check. But I don't know a convenient way, yet. What do you think, @mfsny?