apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Bug]: Dataflow Runner V2 does not support BigueryIO STORAGE_WRITE_API when configured withAutoSharding #31373

Open damnMeddlingKid opened 4 months ago

damnMeddlingKid commented 4 months ago

What happened?

We are attempting to use the STORAGE_WRITE_API with exactly-once guarantees in our pipelines running on Runner V2. Our configuration uses dynamic destinations and auto sharding, as detailed below:

BigQueryIO
          .write[TableRowWithTableId]
          .to(new DynamicDestinationImpl())
          .optimizedWrites()
          .withFormatFunction(_.tableRow)
          .withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
          .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
          .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
          .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
          .withExtendedErrorInfo()
          .withAutoSharding()
          .withTriggeringFrequency(Duration.standardMinutes(15))

Issue Encountered

When we run our pipeline on runner V2 with the above BigQueryIO configuration we get the following error

Error translating pipeline. Runner V2 doesn't support the following SDK features: [Use STORAGE_WRITE_API].

The pipeline executes successfully when we modify the configuration to use a static number of write streams (withNumStorageWriteApiStreams(40)) instead of auto sharding.

While looking for references on this issue I found https://partnerissuetracker.corp.google.com/issues/271105510 which claims that auto sharding should work on Runner V2.

Questions

  1. Is it safe to use a static number of write streams as a work around to using the STORAGE_WRITE_API on runner V2
  2. What is the current state of STORAGE_WRITE_API support on runner V2 ?, im struggling to find an issue or documentation on this.
  3. Is it possible to support auto sharding for BigQueryIO.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

liferoad commented 4 months ago

https://beam.apache.org/documentation/io/built-in/google-bigquery/#writing-to-a-table

image

The internal Runner V2 work has been going on to resolve this issue soon.

For now, you can disable Runner V2 if possible.

cc @scwhittle

scwhittle commented 4 months ago

There is a public gcp issue tracking this as well.

Beyond using v1, another mitigation allowing the use of .withMethod(STORAGE_WRITE_API) with the v2 runner is to both:

You can refer to this blog-post for some guidance on setting the # of streams if you are disabling autosharding.