hasPattern with Regex cannot match empty string

MalfoyJW commented 4 years ago

I have a data set which is consist of empty string like "" 30 rows and some format like [\d]+ 70 rows The value of isComplete("columnName") is 1.0, So I think this column has no null.

But when I try below code on scala project, It can not count empty string like"". .hasPattern("columnName", """[\d]+|^$""".r) => The value of this is 0.7 not 1.0 But If I try on jupyter notebook, it can capture all of rows. df.filter($"columnName" rlike "[\d]+|^$").count() => The value is 100

Is there any special expression for empty string at deequ?

lcgcastro96 commented 3 years ago

Were you able to figure out a solution for your problem? Unfortunately I'm stumbling with the same issue

ramonpineda81 commented 3 years ago

A workaround is to replace empty strings with a special string and then apply the hasPattern over the updated column, i.e. inputDF.withColumn("NEW_COL", regexp_replace(col("COL_NAME"), "^$", "N/A"))

awslabs / deequ

hasPattern with Regex cannot match empty string #243