GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
399 stars 114 forks source link

Increase max partiton generate-table-partitions max value for --partition-num #1137

Closed franklinWhaite closed 1 week ago

franklinWhaite commented 4 months ago

Currently generate-table-partitions only allows for --partition-num up to 1000 partitions. Given that the recommended partition size is of ~50k rows each, we have a limitation when working with tables with more than 50,000,000 rows.

At this moment we need to be able to handle tables of up to ~2 billion records (e.g. ~40k partitions)

helensilva14 commented 4 months ago

Related to #966

sundar-mudupalli-work commented 4 months ago

Hi,

Neha is right. We can increase the number of partitions to 10K by changing the following two lines in cli_tools.py:

        choices=range(1, 1001),
        metavar="[1-1000]",

basically change the limits to 10001 and 10000. To increase the number of partitions higher - say to 1M partitions - we would need to change the number of characters we use for the name of the yaml files. Currently we use 4 characters - so can do 0000.yaml to 9999.yaml. For 1M partitions we would need to use 6 characters, so that we can do 000000.yaml to 999999.yaml. I suggest that we pause on this change in DVT - since it is more complex. First, we would need to change the way we generate the file names. So this line in partition_builder.py:

                target_file_name = "0" * (4 - len(str(pos))) + str(pos) + ".yaml"

could be rewritten as:

                target_file_name = f"{pos:06d}.yaml"

We will also need a change in how we process files in directories if DVT is run via cloud run jobs or K8 - see __main__.py. There we currently look for file names from 0000.yaml to 9999.yaml. We would need to see if the directory contains these files (old generate-table-partitions behavior) or from 000000.yaml to 000000.yaml (new generate-table-partitions behavior). That is these lines:

            config_file_path = (
                f"{args.config_dir}{job_index:04d}.yaml"
                if args.config_dir.endswith("/")
                else f"{args.config_dir}/{job_index:04d}.yaml"
            )

We would also need to test this and will likely need to run DVT on a node with at least 16GB memory to partition a table into 1M partitions. generate-table-partitions could run out of memory - since it is handling 1M rows in memory with the partition information.

Sundar Mudupalli

franklinWhaite commented 4 months ago

If I understand correctly, an increase in max partitions from 1 k -> 10 k is viable. Going higher, e.g. 1 M is more complex. In that case what would be the suggested approach to perform row validation on tables with more than 500 M rows if recommended partition size is 50 k rows.

It is common to encounter tables larger than 500 M rows and even tables in the order of billions.

helensilva14 commented 1 week ago

Closing as completed since the feature request itself was implemented on PR #1139