apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.83k stars 4.24k forks source link

BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded PCollection and FILE_LOADS #18706

Open kennknowles opened 2 years ago

kennknowles commented 2 years ago

My workflow : KAFKA -> Dataflow streaming -> BigQuery

Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDestination, which is a table with the current hour as a suffix.

This BigQueryIO.Write is configured like this :


.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(triggeringFrequency)
.withNumFileShards(100)

The first table is successfully created and is written to. But then the following tables are never created and I get these exceptions:


(99e5cd8c66414e7a): java.lang.RuntimeException: Failed to create load job with id prefix 5047f71312a94bf3a42ee5d67feede75_5295fbf25e1a7534f85e25dcaa9f4986_00001_00023,
reached max retries: 3, last failed load job: {
  "configuration" : {
    "load" : {
      "createDisposition"
: "CREATE_NEVER",
      "destinationTable" : {
        "datasetId" : "dev_mydataset",
        "projectId"
: "myproject-id",
        "tableId" : "mytable_20180302_16"
      },

The CreateDisposition used is _CREATENEVER, contrary as _CREATE_IFNEEDED as specified.

Imported from Jira BEAM-3772. Original Jira may contain additional context. Reported by: benjben.

ragyabraham commented 1 year ago

@kennknowles any updates on this issue? This is causing problems with us in production

matthieucham commented 3 weeks ago

For anyone interested, a workaround is explained here: https://dev.to/stack-labs/tricky-dataflow-ep-1-auto-create-bigquery-tables-in-pipelines-n2k