apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Bug]: BigQuery Storage Write API implementation does not support table partitioning #21893

Open helensilva14 opened 2 years ago

helensilva14 commented 2 years ago

What happened?

Hello! Me and my team found a scenario where we needed to check if Beam can handle the dynamically creation of BQ tables with partitions using the new API.

Like the Spark BigQuery Connector, the Beam connector supports different ways of writing to BigQuery, currently having these write methods available:

  1. FILE_LOADS
  2. STORAGE_API_AT_LEAST_ONCE
  3. STORAGE_WRITE_API
  4. STREAMING_INSERTS

The second and third ones make use of BigQuery Storage Write API (if using the Spark BQ connector, this would be the direct mode)

Regarding table partitions (default BQ time partitioning):

SEVERE: 2022-06-15T16:03:40.193Z: java.lang.NoSuchMethodError: 'long com.google.cloud.bigquery.storage.v1.StreamWriter.getInflightWaitSeconds()'
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl$1.getInflightWaitSeconds(BigQueryServicesImpl.java:1291)
    at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWriteUnshardedRecords$WriteRecordsDoFn$DestinationState.lambda$flush$1(StorageApiWriteUnshardedRecords.java:342)
    at org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.run(RetryManager.java:131)
    at org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:247)
    at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWriteUnshardedRecords$WriteRecordsDoFn.flushAll(StorageApiWriteUnshardedRecords.java:435)
    at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWriteUnshardedRecords$WriteRecordsDoFn.finishBundle(StorageApiWriteUnshardedRecords.java:495)

Regarding table partitions (custom columns):

Conclusion: it seems that Apache Beam connector implementation which uses the BigQuery Storage Write API has problems and limitations regarding table partitions.

The testing pipeline is provided as a Gist here. We hope this issue can be addressed and would be glad to help/validate. Thanks!

Issue Priority

Priority: 1

Issue Component

Component: io-java-gcp

pabloem commented 2 years ago

@reuvenlax I'm assigning this to you. Please do triage / acknowledge whether this makes sense.

reuvenlax commented 2 years ago

This looks like you might have incompatible JARs linked into your binary.

helensilva14 commented 2 years ago

Hi! Coming back to this issue later than expected. Do you mean if I run the pipeline with latest version now (2.40 instead of 2.39) I could get a different result? I'll try and reach out with the results.

reuvenlax commented 2 years ago

It means that your build is probably pulling in a different version of the BigQuery library than the one used by Beam.

On Tue, Aug 16, 2022 at 10:35 AM Helen Cristina @.***> wrote:

Hi! Coming back to this issue later than expected. Do you mean if I run the pipeline with latest version now (2.40 instead of 2.39) I could get a different result? I'll try and reach out with the results.

— Reply to this email directly, view it on GitHub https://github.com/apache/beam/issues/21893#issuecomment-1216945446, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAYJVLK3KZBSKLEPHOCKBLVZPGNVANCNFSM5Y337EKA . You are receiving this because you were mentioned.Message ID: @.***>

pabloem commented 2 years ago

The advice in these cases is to try the following:

mvn dependency:tree | grep bigquery

There you can find the different bigquery-related libraries that you have in your project. Once you've found those, you may find two examples of the same library with different versions - and you can fix this with dependencyManagement - usually by choosing the later version.

You can also use -Dincludes= to filter more intelligently than with grep: https://maven.apache.org/plugins/maven-dependency-plugin/examples/filtering-the-dependency-tree.html