GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.16k stars 977 forks source link

[Bug]: MongoDB to BigQuery breaking schema change from STRING to JSON #1834

Open pilis opened 2 months ago

pilis commented 2 months ago

Related Template(s)

mongodb_to_bigquery

Template Version

2024-08-20-00_rc00

What happened?

I have been using the template for a long time. I have a table in BigQuery where the source_data column is of type STRING (the table is appended with daily job execution). Pull Request https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1796 introduces a change that alters the schema from STRING to JSON. The change in the Pull Request is not backwards compatible. I get the following error while running the job.

Relevant log output

Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000, reached max retries: 3, last failed job: {
  "configuration" : {
    "jobType" : "LOAD",
    "labels" : {
      "beam_job_id" : "2024-09-01_19_07_45-5426785199031709769"
    },
    "load" : {
      "createDisposition" : "CREATE_IF_NEEDED",
      "destinationTable" : {
        "datasetId" : "MY_DATASET_ID",
        "projectId" : "MY_PROJECT_ID",
        "tableId" : "MY_TABLE_ID"
      },
      "ignoreUnknownValues" : false,
      "sourceFormat" : "NEWLINE_DELIMITED_JSON",
      "useAvroLogicalTypes" : false,
      "writeDisposition" : "WRITE_APPEND"
    }
  },
  "etag" : "R6mh8Mzo3e8NvZKP0av9xw==",
  "id" : "MY_PROJECT_ID:MY_REGION_ID.beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
  "jobCreationReason" : {
    "code" : "REQUESTED"
  },
  "jobReference" : {
    "jobId" : "beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
    "location" : "MY_REGION_ID",
    "projectId" : "MY_PROJECT_ID"
  },
  "kind" : "bigquery#job",
  "principal_subject" : "MY_SERVICE_ACCOUNT",
  "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/MY_PROJECT_ID/jobs/beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6?location=MY_REGION_ID",
  "statistics" : {
    "creationTime" : "1725243135832",
    "endTime" : "1725243135858",
    "startTime" : "1725243135858"
  },
  "status" : {
    "errorResult" : {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    },
    "errors" : [ {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    } ],
    "state" : "DONE"
  },
  "user_email" : "MY_SERVICE_ACCOUNT_ID"
}.
    org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
    org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
    org.apache.beam.fn.harness.FnApiDoFnRunner.finishBundle(FnApiDoFnRunner.java:1776)
    org.apache.beam.fn.harness.data.PTransformFunctionRegistry.lambda$register$0(PTransformFunctionRegistry.java:116)
    org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:560)
    org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150)
    org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163)
    java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000, reached max retries: 3, last failed job: {
  "configuration" : {
    "jobType" : "LOAD",
    "labels" : {
      "beam_job_id" : "2024-09-01_19_07_45-5426785199031709769"
    },
    "load" : {
      "createDisposition" : "CREATE_IF_NEEDED",
      "destinationTable" : {
        "datasetId" : "MY_DATASET_ID",
        "projectId" : "MY_PROJECT_ID",
        "tableId" : "MY_TABLE_ID"
      },
      "ignoreUnknownValues" : false,
      "sourceFormat" : "NEWLINE_DELIMITED_JSON",
      "useAvroLogicalTypes" : false,
      "writeDisposition" : "WRITE_APPEND"
    }
  },
  "etag" : "R6mh8Mzo3e8NvZKP0av9xw==",
  "id" : "MY_PROJECT_ID:MY_REGION_ID.beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
  "jobCreationReason" : {
    "code" : "REQUESTED"
  },
  "jobReference" : {
    "jobId" : "beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
    "location" : "MY_REGION_ID",
    "projectId" : "MY_PROJECT_ID"
  },
  "kind" : "bigquery#job",
  "principal_subject" : "MY_SERVICE_ACCOUNT",
  "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/MY_PROJECT_ID/jobs/beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6?location=MY_REGION_ID",
  "statistics" : {
    "creationTime" : "1725243135832",
    "endTime" : "1725243135858",
    "startTime" : "1725243135858"
  },
  "status" : {
    "errorResult" : {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    },
    "errors" : [ {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    } ],
    "state" : "DONE"
  },
  "user_email" : "MY_SERVICE_ACCOUNT_ID"
}.
    org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:208)
    org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:161)
    org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:389)
pilis commented 2 months ago

I am unable to use the previous version of the job - only the latest version is available. I found a workaround where I will prepare the previous version of the template and host it on my end.

Link to previous working image: https://console.cloud.google.com/gcr/images/dataflow-templates/global/2024-08-13-00_rc00/mongodb-to-bigquery Link to template: https://console.cloud.google.com/storage/browser/_details/dataflow-templates-europe-west3/latest/flex/MongoDB_to_BigQuery;tab=version_history I am unable to download the previous version (there is no version history enabled)

👉 Quick global rollback to restore soft-deleted version

turkerakyuz commented 2 months ago

Changing data type of source_data column means all data pipelines appending data to an existing table on BigQuery by using this template should be modified. This is against backward compatibility

kabir-aviva commented 2 months ago

Same issue here, is there any way this can be prioritized?

ahmedabu98 commented 2 months ago

CC @Polber is it necessary to fix the source_data column to JSON type?

ggprod commented 2 months ago

@Polber when will the merged fix be deployed for users of the template?

pcortez commented 2 months ago

Hi! Due to urgency, I had to change the source_data schema to JSON, now that the change was reverted to production, is there a way to use JSON instead of STRING?

Perhaps a new parameter could be added to the template, to be able to choose how to parse source_data.

ggprod commented 1 month ago

@Polber when will the merged fix be deployed for users of the template?

Nevermind, I see it was deployed to the GCS bucket (gs://dataflow-templates-us-central1/latest/flex/MongoDB_to_BigQuery) yesterday at 10am.. Thanks!