GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.14k stars 950 forks source link

String splitter #1743

Closed VardhanThigle closed 1 month ago

VardhanThigle commented 1 month ago

Implementing Logic for Splitting of strings.

Limitation

This PR has complete support for characters upto 3 byte codepoints. If a source dataset has characters which are longer than 3 codepoint-bytes there could be an exception due to characters that cant be mapped. A followup to fix this limitation is under the works.

Overview

In order to do parallel range queries in order to read a table with String column in it's index, we need to split the space of string from min to max into chunks. Unlike integers, strings can't be directly The logic for mapping the string has to take care of various database nuances like:

TODOS (for current PR)

  1. End to End migration manual test.

TODOS (for followup PR)

  1. MySQL 5.7 compatible Collation order and index-discovery queries.
  2. Making the side-input at a DB level (instead of current table level)
codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 95.43269% with 19 lines in your changes missing coverage. Please review.

Project coverage is 49.46%. Comparing base (983c151) to head (2cb1426).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1743 +/- ## ============================================ + Coverage 42.15% 49.46% +7.30% + Complexity 3278 1092 -2186 ============================================ Files 808 337 -471 Lines 47293 18204 -29089 Branches 5053 1844 -3209 ============================================ - Hits 19938 9004 -10934 + Misses 25710 8593 -17117 + Partials 1645 607 -1038 ``` | [Components](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/components?src=pr&el=components&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | Coverage Δ | | |---|---|---| | [spanner-templates](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `64.64% <95.43%> (+1.01%)` | :arrow_up: | | [spanner-import-export](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `∅ <ø> (∅)` | | | [spanner-live-forward-migration](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `75.00% <ø> (ø)` | | | [spanner-live-reverse-replication](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `51.96% <ø> (ø)` | | | [spanner-bulk-migration](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | `83.45% <95.43%> (+1.01%)` | :arrow_up: | | [Files](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | Coverage Δ | | |---|---|---| | [...reader/io/jdbc/uniformsplitter/range/Boundary.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Frange%2FBoundary.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9yYW5nZS9Cb3VuZGFyeS5qYXZh) | `98.38% <ø> (ø)` | | | [...uniformsplitter/range/BoundarySplitterFactory.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Frange%2FBoundarySplitterFactory.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9yYW5nZS9Cb3VuZGFyeVNwbGl0dGVyRmFjdG9yeS5qYXZh) | `100.00% <100.00%> (ø)` | | | [...io/jdbc/uniformsplitter/range/PartitionColumn.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Frange%2FPartitionColumn.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9yYW5nZS9QYXJ0aXRpb25Db2x1bW4uamF2YQ==) | `92.85% <ø> (ø)` | | | [...iformsplitter/stringmapper/CollationReference.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Fstringmapper%2FCollationReference.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci9zdHJpbmdtYXBwZXIvQ29sbGF0aW9uUmVmZXJlbmNlLmphdmE=) | `100.00% <100.00%> (ø)` | | | [...niformsplitter/transforms/CollationMapperDoFn.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Ftransforms%2FCollationMapperDoFn.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci90cmFuc2Zvcm1zL0NvbGxhdGlvbk1hcHBlckRvRm4uamF2YQ==) | `100.00% <100.00%> (ø)` | | | [...formsplitter/transforms/InitialSplitRangeDoFn.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Ftransforms%2FInitialSplitRangeDoFn.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci90cmFuc2Zvcm1zL0luaXRpYWxTcGxpdFJhbmdlRG9Gbi5qYXZh) | `100.00% <100.00%> (ø)` | | | [...ormsplitter/transforms/RangeBoundaryTransform.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Ftransforms%2FRangeBoundaryTransform.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci90cmFuc2Zvcm1zL1JhbmdlQm91bmRhcnlUcmFuc2Zvcm0uamF2YQ==) | `100.00% <100.00%> (ø)` | | | [...niformsplitter/transforms/RangeCountTransform.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Funiformsplitter%2Ftransforms%2FRangeCountTransform.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL3VuaWZvcm1zcGxpdHRlci90cmFuc2Zvcm1zL1JhbmdlQ291bnRUcmFuc2Zvcm0uamF2YQ==) | `100.00% <100.00%> (ø)` | | | [...source/reader/io/schema/SourceColumnIndexInfo.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fschema%2FSourceColumnIndexInfo.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9zY2hlbWEvU291cmNlQ29sdW1uSW5kZXhJbmZvLmphdmE=) | `60.00% <100.00%> (+2.10%)` | :arrow_up: | | [...jdbc/dialectadapter/mysql/MysqlDialectAdapter.java](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree&filepath=v2%2Fsourcedb-to-spanner%2Fsrc%2Fmain%2Fjava%2Fcom%2Fgoogle%2Fcloud%2Fteleport%2Fv2%2Fsource%2Freader%2Fio%2Fjdbc%2Fdialectadapter%2Fmysql%2FMysqlDialectAdapter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform#diff-djIvc291cmNlZGItdG8tc3Bhbm5lci9zcmMvbWFpbi9qYXZhL2NvbS9nb29nbGUvY2xvdWQvdGVsZXBvcnQvdjIvc291cmNlL3JlYWRlci9pby9qZGJjL2RpYWxlY3RhZGFwdGVyL215c3FsL015c3FsRGlhbGVjdEFkYXB0ZXIuamF2YQ==) | `99.62% <97.95%> (-0.38%)` | :arrow_down: | | ... and [8 more](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform) | | ... and [494 files with indirect coverage changes](https://app.codecov.io/gh/GoogleCloudPlatform/DataflowTemplates/pull/1743/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GoogleCloudPlatform)