Closed franklinWhaite closed 1 week ago
Related to #966
Hi,
Neha is right. We can increase the number of partitions to 10K by changing the following two lines in cli_tools.py:
choices=range(1, 1001),
metavar="[1-1000]",
basically change the limits to 10001 and 10000.
To increase the number of partitions higher - say to 1M partitions - we would need to change the number of characters we use for the name of the yaml files. Currently we use 4 characters - so can do 0000.yaml
to 9999.yaml
. For 1M partitions we would need to use 6 characters, so that we can do 000000.yaml
to 999999.yaml
. I suggest that we pause on this change in DVT - since it is more complex.
First, we would need to change the way we generate the file names. So this line in partition_builder.py:
target_file_name = "0" * (4 - len(str(pos))) + str(pos) + ".yaml"
could be rewritten as:
target_file_name = f"{pos:06d}.yaml"
We will also need a change in how we process files in directories if DVT is run via cloud run jobs or K8 - see __main__.py
. There we currently look for file names from 0000.yaml
to 9999.yaml
. We would need to see if the directory contains these files (old generate-table-partitions
behavior) or from 000000.yaml
to 000000.yaml
(new generate-table-partitions
behavior). That is these lines:
config_file_path = (
f"{args.config_dir}{job_index:04d}.yaml"
if args.config_dir.endswith("/")
else f"{args.config_dir}/{job_index:04d}.yaml"
)
We would also need to test this and will likely need to run DVT on a node with at least 16GB memory to partition a table into 1M partitions. generate-table-partitions
could run out of memory - since it is handling 1M rows in memory with the partition information.
Sundar Mudupalli
If I understand correctly, an increase in max partitions from 1 k -> 10 k is viable. Going higher, e.g. 1 M is more complex. In that case what would be the suggested approach to perform row validation on tables with more than 500 M rows if recommended partition size is 50 k rows.
It is common to encounter tables larger than 500 M rows and even tables in the order of billions.
Closing as completed since the feature request itself was implemented on PR #1139
Currently generate-table-partitions only allows for --partition-num up to 1000 partitions. Given that the recommended partition size is of ~50k rows each, we have a limitation when working with tables with more than 50,000,000 rows.
At this moment we need to be able to handle tables of up to ~2 billion records (e.g. ~40k partitions)