awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Rule Suggestions #435

Open dariobig opened 2 years ago

dariobig commented 2 years ago

@dariobig Thank you very much for adding this support. Have been planning to implement them at work. Quick question: would it be feasible to add a new rule suggestion for exact count of the number of records, for a non-nullable column, as well ? (note: please do point me to the right API, in case this capability already exists) Reasoning behind the ask: At the moment, we are getting aprroxCount which is way-off in some cases. In one particular case of mine, there is a difference of 5M records between the actual and approximation (for a non-nullable column with completeness as 1.0). This is happening for both total count and distinct values as well (planning to raise a bug for the same). This check will help during the data migration and replication scenarios (alongside others) where we can automatically determine the success of the migration activity.

Originally posted by @krishna-chaitanya-meduri in https://github.com/awslabs/deequ/issues/434#issuecomment-1253256091

dariobig commented 2 years ago

@krishna-chaitanya-meduri I am trying to understand your use case, hasSize constraint as in Check(CheckLevel.Error, "unit testing my data").hasSize(_ == XXX) work for you? Or is that constraint not working for you?

krishna-chaitanya-meduri commented 2 years ago

@dariobig Yes, this constraints works, just fine. My request is to make it part of Rule suggestion flow which at the moment is not happening (unless I have some gap in understanding the feature). As of now, the constraints are suggested over approx values which is fine for constraints like completeness, distinct values, etc., However, in the case of data migration and replication scenarios, we would like to run an automated column profiler against the source data, before migration or replication, and validate against the target data, post migration or replication. In such cases, an exact value (be it number of records, number of non-null values in a domain critical column - assume one of the partition within a parquet file got skipped, for some reason, during the replication) will help us determine whether the effort is successful. We can do this manually, but an automated way would ease the burden from developer from maintaining the extra code. Happy to discuss further in case the view point is not justifying the ask. Please let me know your thoughts. Thank you.