awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.31k stars 538 forks source link

hasPattern with Regex cannot match empty string #243

Open MalfoyJW opened 4 years ago

MalfoyJW commented 4 years ago

I have a data set which is consist of empty string like "" 30 rows and some format like [\d]+ 70 rows The value of isComplete("columnName") is 1.0, So I think this column has no null.

But when I try below code on scala project, It can not count empty string like"". .hasPattern("columnName", """[\d]+|^$""".r) => The value of this is 0.7 not 1.0 But If I try on jupyter notebook, it can capture all of rows. df.filter($"columnName" rlike "[\d]+|^$").count() => The value is 100

Is there any special expression for empty string at deequ?

lcgcastro96 commented 3 years ago

Were you able to figure out a solution for your problem? Unfortunately I'm stumbling with the same issue

ramonpineda81 commented 3 years ago

A workaround is to replace empty strings with a special string and then apply the hasPattern over the updated column, i.e. inputDF.withColumn("NEW_COL", regexp_replace(col("COL_NAME"), "^$", "N/A"))