The current Invalid Count report metric is confusing. ValidNumerics and ValidStrings use "collect_set" while Bounds rules utilize aggs and report back a 1 as well. These design decisions were made initially for performance on large datasets.
Use the RuleSet().validate function attribute of detailLevel to allow the user to specify the report detail level. Higher levels == longer run times but more detail. Great for dev stages.
The current
Invalid Count
report metric is confusing. ValidNumerics and ValidStrings use "collect_set" while Bounds rules utilize aggs and report back a 1 as well. These design decisions were made initially for performance on large datasets.Use the
RuleSet().validate
function attribute ofdetailLevel
to allow the user to specify the report detail level. Higher levels == longer run times but more detail. Great for dev stages.https://github.com/databrickslabs/dataframe-rules-engine/blob/72da2c71b4b3a26a57c9ff3199650a2e02923730/src/main/scala/com/databricks/labs/validation/RuleSet.scala#L132-L137
https://github.com/databrickslabs/dataframe-rules-engine/blob/72da2c71b4b3a26a57c9ff3199650a2e02923730/src/main/scala/com/databricks/labs/validation/Validator.scala#L148-L149