[Bug]: BigQuery Storage Write API implementation does not support table partitioning

helensilva14 commented 2 years ago

What happened?

Hello! Me and my team found a scenario where we needed to check if Beam can handle the dynamically creation of BQ tables with partitions using the new API.

Like the Spark BigQuery Connector, the Beam connector supports different ways of writing to BigQuery, currently having these write methods available:

FILE_LOADS
STORAGE_API_AT_LEAST_ONCE
STORAGE_WRITE_API
STREAMING_INSERTS

The second and third ones make use of BigQuery Storage Write API (if using the Spark BQ connector, this would be the direct mode)

Regarding table partitions (default BQ time partitioning):

According to the documentation, ALL 4 methods have the option withTimePartitioning, which allows only the default BQ partitions (DAY, MONTH, YEAR). We tested it with the second and third write methods and we got this error:

SEVERE: 2022-06-15T16:03:40.193Z: java.lang.NoSuchMethodError: 'long com.google.cloud.bigquery.storage.v1.StreamWriter.getInflightWaitSeconds()'
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl$1.getInflightWaitSeconds(BigQueryServicesImpl.java:1291)
    at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWriteUnshardedRecords$WriteRecordsDoFn$DestinationState.lambda$flush$1(StorageApiWriteUnshardedRecords.java:342)
    at org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.run(RetryManager.java:131)
    at org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:247)
    at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWriteUnshardedRecords$WriteRecordsDoFn.flushAll(StorageApiWriteUnshardedRecords.java:435)
    at org.apache.beam.sdk.io.gcp.bigquery.StorageApiWriteUnshardedRecords$WriteRecordsDoFn.finishBundle(StorageApiWriteUnshardedRecords.java:495)

Regarding table partitions (custom columns):

We know that BigQuery has the option for specifying a column as time or range partition, and we are interested in that
According to the documentation, unlike the Spark BigQuery connector (indirect mode), the Beam connector does not support specifying custom columns as partitions when writing to BQ
The Spark BigQuery connector (direct mode) claims that it supports partitions if the table is already created and defined with the specific column as a partition.
We tried this approach with the Beam connector, and we got the same error presented above

Conclusion: it seems that Apache Beam connector implementation which uses the BigQuery Storage Write API has problems and limitations regarding table partitions.

The testing pipeline is provided as a Gist here. We hope this issue can be addressed and would be glad to help/validate. Thanks!

Issue Priority

Priority: 1

Issue Component

Component: io-java-gcp

pabloem commented 2 years ago

@reuvenlax I'm assigning this to you. Please do triage / acknowledge whether this makes sense.

reuvenlax commented 2 years ago

This looks like you might have incompatible JARs linked into your binary.

helensilva14 commented 2 years ago

Hi! Coming back to this issue later than expected. Do you mean if I run the pipeline with latest version now (2.40 instead of 2.39) I could get a different result? I'll try and reach out with the results.

reuvenlax commented 2 years ago

It means that your build is probably pulling in a different version of the BigQuery library than the one used by Beam.

On Tue, Aug 16, 2022 at 10:35 AM Helen Cristina @.***> wrote:

Hi! Coming back to this issue later than expected. Do you mean if I run the pipeline with latest version now (2.40 instead of 2.39) I could get a different result? I'll try and reach out with the results.

— Reply to this email directly, view it on GitHub https://github.com/apache/beam/issues/21893#issuecomment-1216945446, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAYJVLK3KZBSKLEPHOCKBLVZPGNVANCNFSM5Y337EKA . You are receiving this because you were mentioned.Message ID: @.***>

pabloem commented 2 years ago

The advice in these cases is to try the following:

mvn dependency:tree | grep bigquery

There you can find the different bigquery-related libraries that you have in your project. Once you've found those, you may find two examples of the same library with different versions - and you can fix this with dependencyManagement - usually by choosing the later version.

You can also use -Dincludes= to filter more intelligently than with grep: https://maven.apache.org/plugins/maven-dependency-plugin/examples/filtering-the-dependency-tree.html

apache / beam