GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
395 stars 112 forks source link

data quality check against only one (real) database connector? #1209

Open scorpionjacky opened 1 month ago

scorpionjacky commented 1 month ago

Is it possible to use only one db_connector to do exists test, or check uniqueness of a column. I noticed data-validation requires both -sc and -tc as parameters so that it always run a query on both of them.

Is it poosible to run query in one db connector and compare the result to a constant value (fake 2nd db connector but w/o actually doing connection and running query). This can be used to check existence of some counts to some known threshold value (like 0).

Basically this extends the feature to data quality check against one database but using a fake/dummy 2nd connector so that the API/CLI is still consistent?

nj1973 commented 1 month ago

That's not functionality we have considered previously and we don't have a specific technique for it.

I had a think about how we could subvert existing functionality and something like this might do the trick (at least functionally - but it is a hack):

Compare my Oracle count against an expected count of 55501:

data-validation validate custom-query row -sc=ora -tc=ora \
-sq="SELECT 1 id, COUNT(*) c FROM sh.customers" \
-tq="SELECT 1 id, 55501 c FROM dual" \
--primary-keys=id -comp-fields=c
...
╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤════════════════════╤════════════════════╤══════════════════╤══════════════════
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │   source_agg_value │   target_agg_value │ pct_difference   │ validation_status
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╪══════════════════
│ c                 │ Custom-query      │                     │ c                    │              55500 │              55501 │                  │ fail
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧════════════════════╧════════════════════╧══════════════════╧══════════════════
scorpionjacky commented 3 weeks ago

That's the same approach I would have to do using the existing design. If you have only a bigquery DWH w/o other database server types, the query just to get a constant would have to invoke a BQ job which take longer time than running against a relational database, like PostgreSQL (much less to say just a constant w/o a database connection), due to the asynchronous job approach of BQ. Overall with many such cases, it would take more time to run all tests than assumed in terms of efficiency.

Support of constant values with aliases (for reference by names) would be the most efficient if available.

Thank you for your reply with the example @nj1973, much appreciated.