GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
385 stars 108 forks source link

Getting MemoryError for bigger tables(numpy.core._exceptions._ArrayMemoryError) #1144

Closed pavannadgoudar closed 1 month ago

pavannadgoudar commented 1 month ago

Table is big so we tried with comparing only 2 columns still we are getting memory error.

Command : data-validation validate row --source-conn MSSQL --target-conn BQ --tables-list = --filter-status fail --primary-keys -comp-fields , --bq-result-handler ..DVT_RESULTS --filters ='2024-01-10'

Error from command : numpy.core._exceptions._ArrayMemoryError : Unable to allocate GiB for an array with shape(,) and data type object

Table Volume : 2.5L records and 216 Columns

nehanene15 commented 1 month ago

Memory constraints can occur for large row validations since each row needs to be brought into memory to compare source and target values. For large tables, you will likely need to use the Scaling DVT approach by generating table partitions and then distributing that (optional).

You can also increase the machine size, but that won't be a feasible fix for very large tables.

pavannadgoudar commented 1 month ago

Thank you very much @nehanene15 . This information is helpful. I think scaling needs an additional infra level changes. We will initiate this discussion to check feasibility on GKE/Cloud Run Jobs.

nehanene15 commented 1 month ago

Sounds good. I will close this issue and feel free to open another if you encounter a bug when using the Cloud Run approach.