GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.14k stars 950 forks source link

[Bug]: Dataflow - MongoDB-to-BigQuery batch mode failing with filter on data #1328

Open robbycarter opened 6 months ago

robbycarter commented 6 months ago

Related Template(s)

MongoDB-to-BigQuery

Template Version

v2

What happened?

I have function that checks if a field is true or not. If its true then it returns null to skip saving that document into BigQuery.

I have tried doing a return undefined, return "" and i keep getting the same issue which is

com.google.cloud.teleport.v2.common.UncaughtExceptionLogger - The template launch failed.
java.lang.IllegalArgumentException: schema can not be null

Below is a code snippet

function deliveries_transform(input_doc) {
  var doc = JSON.parse(input_doc)

  // Filters
  if (doc.has_parent) {
    return null;
  }

  //return after stringifying
  return JSON.stringify(doc);
}

I referred to the example stated in this link https://cloud.google.com/dataflow/docs/guides/templates/create-template-udf#filter_events

The job was created using the google console and not via api or sdk.

Relevant log output

[
  {
    "insertId": "",
    "jsonPayload": {
      "line": "exec.go:66",
      "message": "com.google.cloud.teleport.v2.common.UncaughtExceptionLogger - The template launch failed.\njava.lang.IllegalArgumentException: schema can not be null\n\tat org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)\n\tat org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.withSchema(BigQueryIO.java:2679)\n\tat com.google.cloud.teleport.v2.mongodb.templates.MongoDbToBigQuery.run(MongoDbToBigQuery.java:154)\n\tat com.google.cloud.teleport.v2.mongodb.templates.MongoDbToBigQuery.main(MongoDbToBigQuery.java:96)\n"
    },
    "resource": {
      "type": "dataflow_step",
      "labels": {
        "region": "",
        "project_id": "",
        "step_id": "",
        "job_name": "mongodb-to-bigquery-batch",
        "job_id": ""
      }
    },
    "timestamp": "2024-02-12T21:45:00.037010Z",
    "severity": "ERROR",
    "labels": {
      "compute.googleapis.com/resource_name": "",
      "dataflow.googleapis.com/region": "us-east4",
      "dataflow.googleapis.com/job_id": "",
      "compute.googleapis.com/resource_id": "",
      "compute.googleapis.com/resource_type": "",
      "dataflow.googleapis.com/job_name": "mongodb-to-bigquery-batch"
    },
    "logName": "",
    "receiveTimestamp": "2024-02-12T21:45:02.855403339Z",
    "errorGroups": [
      {
        "id": "CPXppsbT8JP4nQE"
      }
    ]
  },
  {
    "insertId": "",
    "jsonPayload": {
      "message": "Error: Template launch failed: exit status 1",
      "line": "launch.go:80"
    },
    "resource": {
      "type": "dataflow_step",
      "labels": {
        "job_name": "mongodb-to-bigquery-batch",
        "job_id": "",
        "step_id": "",
        "project_id": "",
        "region": ""
      }
    },
    "timestamp": "",
    "severity": "ERROR",
    "labels": {
      "dataflow.googleapis.com/region": "",
      "dataflow.googleapis.com/job_id": "",
      "compute.googleapis.com/resource_id": "",
      "compute.googleapis.com/resource_type": "",
      "compute.googleapis.com/resource_name": "",
      "dataflow.googleapis.com/job_name": "mongodb-to-bigquery-batch"
    },
    "logName": "",
    "receiveTimestamp": "2024-02-12T21:45:02.855403339Z"
  },
  {
    "textPayload": "Error occurred in the launcher container: Template launch failed. See console logs.",
    "insertId": "xl5y9bd22ed",
    "resource": {
      "type": "dataflow_step",
      "labels": {
        "project_id": "",
        "job_id": "2024-02-12_13_43_46-15601135711795228441",
        "job_name": "mongodb-to-bigquery-batch",
        "step_id": "",
        "region": ""
      }
    },
    "timestamp": "2024-02-12T21:47:43.432514787Z",
    "severity": "ERROR",
    "labels": {
      "dataflow.googleapis.com/job_id": "2024-02-12_13_43_46-15601135711795228441",
      "dataflow.googleapis.com/region": ",
      "dataflow.googleapis.com/log_type": "",
      "dataflow.googleapis.com/job_name": "mongodb-to-bigquery-batch"
    },
    "logName": "",
    "receiveTimestamp": "2024-02-12T21:47:43.962727013Z"
  }
]
britz89 commented 5 months ago

Hi,

I'm encountering the same issue. If I use a "return null" statement when I try to skip the document row I obtain the "schema can not be null" error. Did anyone manage to resolve the issue? Many thanks!

robbycarter commented 5 months ago

Hi,

I'm encountering the same issue. If I use a "return null" statement when I try to skip the document row I obtain the "schema can not be null" error. Did anyone manage to resolve the issue? Many thanks!

Hi @britz89 . I have not found a fix but I found an alternative way to skip it. I pull all the data then use a saved query to run and create a new Table from the import. I have the filter applied in that saved query

britz89 commented 5 months ago

So if I understood correctly you are pulling the full collection, storing in a temp table and then in a subsequent step filtering the rows. Correct? My requirement is to avoid a full copy of the collection, so I hope that this issue will be fixed otherwise I will have to find another way. Thanks for your suggestion, btw!

robbycarter commented 5 months ago

So if I understood correctly you are pulling the full collection, storing in a temp table and then in a subsequent step filtering the rows. Correct? My requirement is to avoid a full copy of the collection, so I hope that this issue will be fixed otherwise I will have to find another way. Thanks for your suggestion, btw!

Yes that is what I am currently doing until it is fixed because I need a solution up. The other alternative I thought about is using a custom batch template and fixing the issue.