GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.12k stars 936 forks source link

Adding support of uniform partitioning for numeric types and composite keys. #1657

Closed VardhanThigle closed 4 weeks ago

VardhanThigle commented 1 month ago

Adding support for ReadWithUniformPartition for numeric types and composite keys.

Summary

ReadWithUniformPartition is almost equivalent in the basic contract with JDBCIO.readWithPartition.

In addition to JDBCIO.readWithPartition, this transforms supports

  1. Near uniform splitting of the input key space based on range counts. No partition will have a count greater than twice the expected mean.
  2. Uses composite keys for splitting when necessary.
  3. Allows injection of type-mapper for making it easier to support strings in future.

    Overview of commits.

    This change composes of mainly these parts (in separate commits)

  4. Basic Range and boundary classes. This part implements basic classes to represent a splittable boundary and range. An unsplittable range can have child ranges as columns get added to the splitting process.
  5. DBAdapter and statement preparator implementation to get count and boundary (min, max) of a range.
  6. Transforms to iteratively split the ranges till a near-uniform split is achieved.
  7. Integration with larger reader under a feature flag.

    Feature Flag.

    Currently there is a feature flag in JdbcIOWrapperConfig named readWithUniformPartitionsFeatureEnabled which controls if the new partitioning logic run in the migration or not.

  8. As of now the flag is default to enabled.
  9. It's not exposed as a pipeline option (which unfortunately means tooggle need rebuild) so that options don't get added and reverted.

Performance

  1. The splitting takes ~ 2 to 3 mins per table (1 TB table).
  2. If the job is running on multiple parallel tables, please consider dding DATAFLOW_SERVICE_OPTIONS="min_num_workers=" to the dataflow job as dataflow tends to scale down quickly.

Note - unless we have the entire flow from the basic range class to integration, its hard to test this on a real migration.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 96.24531% with 30 lines in your changes missing coverage. Please review.

Project coverage is 48.27%. Comparing base (c114330) to head (23316ca).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1657 +/- ## ============================================ + Coverage 41.27% 48.27% +6.99% + Complexity 2940 984 -1956 ============================================ Files 771 326 -445 Lines 45127 17453 -27674 Branches 4819 1737 -3082 ============================================ - Hits 18626 8425 -10201 + Misses 24935 8453 -16482 + Partials 1566 575 -991 ``` | [Components](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/components?src=pr&el=components&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | Coverage Δ | | |---|---|---| | [spanner-templates](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `62.90% <96.24%> (+1.62%)` | :arrow_up: | | [spanner-import-export](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `∅ <ø> (∅)` | | | [spanner-live-forward-migration](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `74.14% <ø> (ø)` | | | [spanner-live-reverse-replication](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `50.56% <ø> (ø)` | | | [spanner-bulk-migration](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `83.45% <96.24%> (+2.81%)` | :arrow_up: | | [Files](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | Coverage Δ | | |---|---|---| | [...jdbc/dialectadapter/mysql/MysqlDialectAdapter.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Fdialectadapter%2Fmysql%2FMysqlDialectAdapter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL2RpYWxlY3RhZGFwdGVyL215c3FsL015c3FsRGlhbGVjdEFkYXB0ZXIuamF2YQ==) | `100.00% <100.00%> (ø)` | | | [.../io/jdbc/iowrapper/config/JdbcIOWrapperConfig.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Fiowrapper%2Fconfig%2FJdbcIOWrapperConfig.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL2lvd3JhcHBlci9jb25maWcvSmRiY0lPV3JhcHBlckNvbmZpZy5qYXZh) | `100.00% <100.00%> (ø)` | | | [...e/reader/io/jdbc/iowrapper/config/TableConfig.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Fiowrapper%2Fconfig%2FTableConfig.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL2lvd3JhcHBlci9jb25maWcvVGFibGVDb25maWcuamF2YQ==) | `100.00% <ø> (ø)` | | | [...plitter/columnboundary/ColumnForBoundaryQuery.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Fcolumnboundary%2FColumnForBoundaryQuery.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9jb2x1bW5ib3VuZGFyeS9Db2x1bW5Gb3JCb3VuZGFyeVF1ZXJ5LmphdmE=) | `100.00% <100.00%> (ø)` | | | [...ColumnForBoundaryQueryPreparedStatementSetter.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Fcolumnboundary%2FColumnForBoundaryQueryPreparedStatementSetter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9jb2x1bW5ib3VuZGFyeS9Db2x1bW5Gb3JCb3VuZGFyeVF1ZXJ5UHJlcGFyZWRTdGF0ZW1lbnRTZXR0ZXIuamF2YQ==) | `100.00% <100.00%> (ø)` | | | [...niformsplitter/range/BoundaryExtractorFactory.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Frange%2FBoundaryExtractorFactory.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9yYW5nZS9Cb3VuZGFyeUV4dHJhY3RvckZhY3RvcnkuamF2YQ==) | `100.00% <100.00%> (ø)` | | | [...uniformsplitter/range/BoundarySplitterFactory.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Frange%2FBoundarySplitterFactory.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9yYW5nZS9Cb3VuZGFyeVNwbGl0dGVyRmFjdG9yeS5qYXZh) | `100.00% <100.00%> (ø)` | | | [...rmsplitter/range/RangePreparedStatementSetter.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Frange%2FRangePreparedStatementSetter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9yYW5nZS9SYW5nZVByZXBhcmVkU3RhdGVtZW50U2V0dGVyLmphdmE=) | `100.00% <100.00%> (ø)` | | | [...formsplitter/transforms/InitialSplitRangeDoFn.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Ftransforms%2FInitialSplitRangeDoFn.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci90cmFuc2Zvcm1zL0luaXRpYWxTcGxpdFJhbmdlRG9Gbi5qYXZh) | `100.00% <100.00%> (ø)` | | | [...ormsplitter/transforms/RangeBoundaryTransform.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Ftransforms%2FRangeBoundaryTransform.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci90cmFuc2Zvcm1zL1JhbmdlQm91bmRhcnlUcmFuc2Zvcm0uamF2YQ==) | `100.00% <100.00%> (ø)` | | | ... and [14 more](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | | ... and [480 files with indirect coverage changes](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1657/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform)