GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
405 stars 119 forks source link

Error on Snowflake binary primary key random row validations #1146

Closed nj1973 closed 5 months ago

nj1973 commented 5 months ago

When validating tables using random rows we build a list of primary key literal values and then inject that list into the validation queries as an IN list, for example:

FROM `db-01.dvt_test.dvt_num_pk` t6
WHERE t6.`ID` IN (1,20,100,500,123456)

For tables with binary primary keys we cannot inject binary literals into a SQL statement and therefore convert the binary values to hex first and cast them back to binary in SQL, for example:

FROM `db-01.dvt_test.dvt_bin_pk` t7
WHERE t7.`BINARY_ID` IN (FROM_HEX('4456542d6b65792d34'), FROM_HEX('4456542d6b65792d313132'), 
  FROM_HEX('4456542d6b65792d3236'), FROM_HEX('4456542d6b65792d3130'), FROM_HEX('4456542d6b65792d3531'))

We have an issue on Snowflake for binary primary keys when the random row sample size is > 50:

SQL compilation error: In-list contains more than 50 non-constant values

We need a workaround for Snowflake engines.

cc: @henrihem

nj1973 commented 5 months ago

Hi @henrihem, the fix to this issue should break up the problem Snowflake IN clauses into a series of IN/OR combinations. I was unable to find a way to use array manipulation like you suggested via the framework we use. Plus this solution is applicable to other SQL engines should we discover similar limitation elsewhere. Let me know if this fix does not resolve your issue.

henrihem commented 5 months ago

Hi @nj1973

There are a few bugs in this fix. I'll report those below.

Once I was able to bypass those bugs, the fix for the described issue works as intended. Was able to run random row validations with 1000 rows. So the fix itself seems to work.

Issue 1

venv/lib/python3.10/site-packages/data_validation/clients.py", line 27, in <module>
    from third_party.ibis.ibis_cloud_spanner.api import spanner_connect
ModuleNotFoundError: No module named 'third_party'

When clean installing from develop-branch to venv, there is issue that third_party is missing completely. Was able to add the third_party manually and after that is works

Issue 2

Not sure if this is issue or just intended limitation for random row validations, but with this fix the limit of random rows looks to be around 5000 and 10 000 rows. With 5000 random rows it works just fine, but with 10 000 rows it throws:

RecursionError: maximum recursion depth exceeded while calling a Python object

If there is some maximum amount of random rows, that should be documented.

helensilva14 commented 5 months ago

Hi @henrihem, thanks a lot for your detailed comment and continuous collaboration with us!

Issue 1 was a side-effect of a configuration file switch that we attempted but we already reverted it at #1156. You should not face it anymore, very much appreciated for letting us know.

And about Issue 2 we just merged a bug fix at #1158 that @nj1973 greatly solved. If you could please take a look and try to test it would be great, thank you!

nj1973 commented 5 months ago

Sorry, to be clear, my bug fix does not solve issue 2. I just noticed a problem with my original fix while investigating it.

We are hitting a Python limit due to the way Ibis builds OR clauses and the high number of them due to the 50 item IN list limit on Snowflake when the items are expressions. We won't have this problem for regular IN lists, only ones on binary columns. I think we might have to accept this is a limitation and think about how we document it.