databrickslabs / dataframe-rules-engine

Extensible Rules Engine for custom Dataframe / Dataset validation
Other
134 stars 30 forks source link

Invalid Count - Add verbosity options and clarification #14

Closed GeekSheikh closed 3 years ago

GeekSheikh commented 3 years ago

The current Invalid Count report metric is confusing. ValidNumerics and ValidStrings use "collect_set" while Bounds rules utilize aggs and report back a 1 as well. These design decisions were made initially for performance on large datasets.

Use the RuleSet().validate function attribute of detailLevel to allow the user to specify the report detail level. Higher levels == longer run times but more detail. Great for dev stages.

https://github.com/databrickslabs/dataframe-rules-engine/blob/72da2c71b4b3a26a57c9ff3199650a2e02923730/src/main/scala/com/databricks/labs/validation/RuleSet.scala#L132-L137

https://github.com/databrickslabs/dataframe-rules-engine/blob/72da2c71b4b3a26a57c9ff3199650a2e02923730/src/main/scala/com/databricks/labs/validation/Validator.scala#L148-L149