Adding support for ReadWithUniformPartition for numeric types and composite keys.
Summary
ReadWithUniformPartition is almost equivalent in the basic contract with JDBCIO.readWithPartition.
In addition to JDBCIO.readWithPartition, this transforms supports
Near uniform splitting of the input key space based on range counts. No partition will have a count greater than twice the expected mean.
Uses composite keys for splitting when necessary.
Allows injection of type-mapper for making it easier to support strings in future.
Overview of commits.
This change composes of mainly these parts (in separate commits)
Basic Range and boundary classes. This part implements basic classes to represent a splittable boundary and range. An unsplittable range can have child ranges as columns get added to the splitting process.
DBAdapter and statement preparator implementation to get count and boundary (min, max) of a range.
Transforms to iteratively split the ranges till a near-uniform split is achieved.
Integration with larger reader under a feature flag.
Feature Flag.
Currently there is a feature flag in JdbcIOWrapperConfig named readWithUniformPartitionsFeatureEnabled which controls if the new partitioning logic run in the migration or not.
As of now the flag is default to enabled.
It's not exposed as a pipeline option (which unfortunately means tooggle need rebuild) so that options don't get added and reverted.
Performance
The splitting takes ~ 2 to 3 mins per table (1 TB table).
If the job is running on multiple parallel tables, please consider dding DATAFLOW_SERVICE_OPTIONS="min_num_workers=" to the dataflow job as dataflow tends to scale down quickly.
Note - unless we have the entire flow from the basic range class to integration, its hard to test this on a real migration.
Adding support for
ReadWithUniformPartition
for numeric types and composite keys.Summary
ReadWithUniformPartition
is almost equivalent in the basic contract with JDBCIO.readWithPartition.In addition to
JDBCIO.readWithPartition
, this transforms supportsOverview of commits.
This change composes of mainly these parts (in separate commits)
Feature Flag.
Currently there is a feature flag in
JdbcIOWrapperConfig
namedreadWithUniformPartitionsFeatureEnabled
which controls if the new partitioning logic run in the migration or not.Performance
DATAFLOW_SERVICE_OPTIONS="min_num_workers="
to the dataflow job as dataflow tends to scale down quickly.Note - unless we have the entire flow from the basic range class to integration, its hard to test this on a real migration.