awslabs / python-deequ

Python API for Deequ
Apache License 2.0
702 stars 132 forks source link

Profiler fails with KLL sketch enabled for null columns #12

Closed MOHACGCG closed 3 years ago

MOHACGCG commented 3 years ago

Describe the bug When KLL is enabled for profiling to calculate percentile quantiles, if a column is all null values (completeness = 0), the conversion of the quantile percentiles fails from java [''] to python list in the java_list_to_python_list function in scala_utils.

To Reproduce Steps to reproduce the behavior:

  1. create a data-frame with column_1 defined as a Numeric type
  2. add only null values to column_1
  3. calculate the profile with KLL enabled Failure: pydeequ/scala_utils.py", line 101, in <listcomp> vals = [datatype(i) for i in java_list[start+1:end].split(',')] ValueError: could not convert string to float:

Expected behavior Profile should be calculated and percentiles should be None

Proposed Change Change the behavior so that empty values are handled as None except for string values. https://github.com/awslabs/python-deequ/pull/11

MOHACGCG commented 3 years ago

code to generate the issue:

schema = StructType().add(StructField("numeric_column", LongType()))
df = spark.createDataFrame([{"numeric_column": None}], schema=schema)
ColumnProfilerRunner(spark).onData(df).withKLLProfiling().run()