Closed nj1973 closed 5 months ago
Hi @henrihem, the fix to this issue should break up the problem Snowflake IN clauses into a series of IN/OR combinations. I was unable to find a way to use array manipulation like you suggested via the framework we use. Plus this solution is applicable to other SQL engines should we discover similar limitation elsewhere. Let me know if this fix does not resolve your issue.
Hi @nj1973
There are a few bugs in this fix. I'll report those below.
Once I was able to bypass those bugs, the fix for the described issue works as intended. Was able to run random row validations with 1000 rows. So the fix itself seems to work.
venv/lib/python3.10/site-packages/data_validation/clients.py", line 27, in <module>
from third_party.ibis.ibis_cloud_spanner.api import spanner_connect
ModuleNotFoundError: No module named 'third_party'
When clean installing from develop-branch to venv, there is issue that third_party is missing completely. Was able to add the third_party manually and after that is works
Not sure if this is issue or just intended limitation for random row validations, but with this fix the limit of random rows looks to be around 5000 and 10 000 rows. With 5000 random rows it works just fine, but with 10 000 rows it throws:
RecursionError: maximum recursion depth exceeded while calling a Python object
If there is some maximum amount of random rows, that should be documented.
Hi @henrihem, thanks a lot for your detailed comment and continuous collaboration with us!
Issue 1 was a side-effect of a configuration file switch that we attempted but we already reverted it at #1156. You should not face it anymore, very much appreciated for letting us know.
And about Issue 2 we just merged a bug fix at #1158 that @nj1973 greatly solved. If you could please take a look and try to test it would be great, thank you!
Sorry, to be clear, my bug fix does not solve issue 2. I just noticed a problem with my original fix while investigating it.
We are hitting a Python limit due to the way Ibis builds OR clauses and the high number of them due to the 50 item IN list limit on Snowflake when the items are expressions. We won't have this problem for regular IN lists, only ones on binary columns. I think we might have to accept this is a limitation and think about how we document it.
When validating tables using random rows we build a list of primary key literal values and then inject that list into the validation queries as an IN list, for example:
For tables with binary primary keys we cannot inject binary literals into a SQL statement and therefore convert the binary values to hex first and cast them back to binary in SQL, for example:
We have an issue on Snowflake for binary primary keys when the random row sample size is > 50:
We need a workaround for Snowflake engines.
cc: @henrihem