awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.31k stars 539 forks source link

Pqdeequ isfractional is not working as expected. #352

Closed prashantprp closed 3 years ago

prashantprp commented 3 years ago

image

I am using the supermarket_sales.csv import pydeequ from pydeequ.analyzers import * df = spark.read.option("header","true").csv("/FileStore/tables/supermarket_sales.csv")

and the .hasDataType("Rating",ConstrainableDataTypes.Fractional) returns false citing - |Value: 0.881 does not meet the constraint requirement, but it is a small 1000 row data set and there is no 0.881 column on the rating column, where does deequ randomly pull this information from. supermarket_sales.zip

mschandra18 commented 3 years ago

The output means 88% of the rows contains data of type fractional. As the constraint looks for 100% match, it is failing. 0.881 -> Is the ratio of rows within the constraint to that of the total no. of rows.

aviatesk commented 3 years ago

Use .hasDataType(column, ConstrainableTypes.Numeric) if you want to allow both Integral and Fractional column types.

FWIW, my fork of deequ will suggest constraint with Numeric if your column contains both types: https://github.com/aviatesk/deequ/pull/2

lange-labs commented 3 years ago

Closing due to inactivity