GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
396 stars 112 forks source link

Column validation on table with many columns raises maximum recursion depth exception #1251

Open nj1973 opened 1 week ago

nj1973 commented 1 week ago

We've had issues reporting this for row validation previously but it can also be an issue for column validation.

Example command using our integration test table:

data-validation validate column --source-conn=bq --target-conn=bq \
      --tables-list="pso_data_validator.dvt_many_cols" \
      --count="*"
...
  File "/some-path/professional-services-data-validator/data_validation/data_validation.py", line 338, in _execute_validation
    result_df = combiner.generate_report(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/some-path/professional-services-data-validator/data_validation/combiner.py", line 77, in generate_report
    differences_df = client.execute(differences_pivot)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/some-path/professional-services-data-validator/.venv/lib/python3.11/site-packages/ibis/backends/pandas/__init__.py", line 307, in execute
    return execute_and_reset(node, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/usr/lib/python3.11/weakref.py", line 415, in __getitem__
    return self.data[ref(key)]
                     ^^^^^^^^
RecursionError: maximum recursion depth exceeded while calling a Python object

This was found while doing my own testing and not reported by a customer.

nj1973 commented 1 week ago

It is interesting that the exception is thrown in combiner.py.generate_report(). That "suggests" the actual Ibis queries completed successfully. I quoted "suggests" because I may be mistaken. We should look at what we are doing here to see if it can be improved.

We could also look into sys.setrecursionlimit(n), perhaps increasing from the default is sensible.

We also need to add column validation tests for pso_data_validator.dvt_many_cols.