Nike-Inc / spark-expectations

A Python Library to support running data quality rules while the spark job is running⚡
https://engineering.nike.com/spark-expectations
Apache License 2.0
148 stars 32 forks source link

[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

Closed vigneshwarrvenkat closed 1 month ago

vigneshwarrvenkat commented 7 months ago

Is your feature request related to a problem? Please describe. The query DQ feature does provide output in terms of boolean vales. The boolean value of FALSE does inform us that there is something wrong in the query validation results but user won't knew what went wrong. They have to rerun the query manually to figure out the difference. If it is production support team, it is highly unlikely that they are aware of the validation scripts. It becomes tough to get actionable insights out of query DQ feature. If the query results of both the source and target is fetched and stored in a custom stats table, it would be useful for users to build actionable insights or work items based on the results

Describe the solution you'd like Right now, it is programmed to pass one query to the QueryDQ. Instead, we can pass three queries as below.

select X from table1; select Y from table2; select x=y from t1 join t2 Queries are separated by semi colons. If it is one query, it is the default behaviour and for three, the behaviour is as below.

X and Y are values to be compared between source and target respectively. Third query is the validation query. If the validation is FALSE, then we can fetch the X and Y values and store it as JSON in a custom stats table. The custom table is user managed and should be passed as the argument as below.

SparkExpectations(custom_dq_info_table = ""...)

Describe alternatives you've considered We are right now implementing the above option as a separate module and use it along with other features of SparkExpectation

Additional context The custom table is user managed. Permissions and other stuffs have to be handled by the user. Number of rows could be restricted to 200 initially for the records to be stored in the custom stats table.

vigneshwarrvenkat commented 4 months ago

Connected with @asingamaneni and @jskrajareddy21 on this enhancement. Below are the bug fixes and feature request coupled to this enhancement request

  1. The querydq should execute as is with multiple delimited queries.
  2. As there is an ask to send the details of the detailed stats table to Kafka, If any data is stored in the detailed stats table, it has to be masked before sending to Kafka.
  3. There should not be any limitation on the number of delimited query_dq queries.
  4. Handle edge cases where the one of the delimited query_dq query can be a int or float.

We have started working on this..