awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Change message for isUnique method #153

Open gerileka opened 10 months ago

gerileka commented 10 months ago

For the past day I have been using the Check methods and I have found that isUnique method when using strings is not clear when returning the message of error.

from pydeequ.checks import *
from pydeequ.verification import *
import pydeequ
from pyspark.sql.types import DateType, FloatType, StringType, StructField, StructType, BooleanType
import datetime
from pyspark.sql import SparkSession, Row
from pyspark.sql import DataFrame as SparkDataFrame
from typing import Dict, List
import time

mock_orders =[
            {
                "date": datetime.date(2019, 12, 28),
                "country_code": "FR",
                "concept_id": "c73bcdcc-2669-4bf6-81d3-e4ae73fb11fd",
                "id": "bar",
                "gtv": 27.0,
            },
            {
                "date": datetime.date(2019, 12, 20),
                "country_code": "UK",
                "concept_id": "123e4567-e89b-12d3-a456-426655440000",
                "id": "bar",
                "gtv": 27.0,
            },
]

orders_reference_mock = spark.createDataFrame(data = mock_orders)

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = (VerificationSuite(spark) 
    .onData(orders_reference_mock) 
    .addCheck(
        check 
        .isUnique("gtv")  
        .isUnique("id") 
    )
    .run())

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)

checkResult_df.collect()

The results is:

[Row(check='Review Check', check_level='Warning', check_status='Warning', constraint='UniquenessConstraint(Uniqueness(List(gtv),None))', constraint_status='Failure', constraint_message='Value: 0.0 does not meet the constraint requirement!'),
 Row(check='Review Check', check_level='Warning', check_status='Warning', constraint='UniquenessConstraint(Uniqueness(List(id),None))', constraint_status='Failure', constraint_message='Value: 0.0 does not meet the constraint requirement!')]

The constraint_message is not clear and doesn't give any information. It happens in both cases if it is string or integer.

Is it possible to have a more clear message please? I am putting this as a feature instead of a bug

chenliu0831 commented 10 months ago

Yeah the constraint_message for historical reason actually means the ratio. It cams from Deequ and suffers from this bug https://github.com/awslabs/deequ/issues/245.

chenliu0831 commented 10 months ago

Is it possible to have a more clear message please? I am putting this as a feature instead of a bug

One way is to accept a hook/function for users to customize the error message template. In that case it's indeed a feature request. Seems possible to do this in Python land though