dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
216 stars 153 forks source link

[Bug] Dataproc (Python models) do not retry polling on 5xx errors #1271

Open OSalama opened 3 months ago

OSalama commented 3 months ago

Is this a new bug in dbt-bigquery?

Current Behavior

After dbt-bigquery submits a dataproc batch job, it enters a polling mode, waiting for a response to indicate the job completed (dbt/adapters/bigquery/dataproc/batch.py#29). This polling process is not retrying transient errors, so the dbt run ends up failing with an error, while the actual dataproc job runs to completion successfully.

Expected Behavior

We should be performing the dataproc batch polling with a retry strategy to retry transient errors.

The BatchControllerClient.get_batch() function takes a retry parameter:

retry (google.api_core.retry.Retry): Designation of what errors, if any,should be retried.`

so we should pass one in!

Steps To Reproduce

This relies on GCP API throwing a 5xx error so it's not very reproducible, though we are hitting the error at least once a day in our runs.

  1. Use dbt-bigquery 1.8.1

  2. Configure Dataproc serverless in profiles.yml:

    target: dev
    outputs:
    dev:
      type: bigquery
      location: EU
      job_retries: 5
      dataproc_batch:
        environment_config:
          execution_config:
            service_account: dbt-dev-runner@data-dev.iam.gserviceaccount.com
            subnetwork_uri: projects/data-dev/regions/europe-west1/subnetworks/dataproc-dbt-subnet
  3. Submit many long-running Python models

  4. Wait until one of them fails.

Relevant log output

11:05:53.104591 [info ] [Thread-1  ]: BigQuery adapter: Submitting batch job with id: d9cb4840-a1f1-49f3-bd8b-2af2d8dbcbfd
11:06:40.287936 [error] [Thread-1  ]: Unhandled error while executing target/run/core/models/data_tables/DT_segments_variants.py
503 502:Bad Gateway
11:06:40.298010 [debug] [Thread-1  ]: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "502:Bad Gateway"
    debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-07-04T11:06:40.285827131+00:00", grpc_status:14, grpc_message:"502:Bad Gateway"}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/dbt/task/base.py", line 368, in safe_run
    result = self.compile_and_execute(manifest, ctx)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/base.py", line 314, in compile_and_execute
    result = self.run(ctx.node, manifest)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/base.py", line 415, in run
    return self.execute(compiled_node, manifest)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/run.py", line 298, in execute
    result = MacroGenerator(
  File "/usr/local/lib/python3.9/site-packages/dbt/clients/jinja.py", line 84, in __call__
    return self.call_macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt_common/clients/jinja.py", line 298, in call_macro
    return macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 768, in __call__
    return self._invoke(arguments, autoescape)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 782, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 89, in macro
  File "/usr/local/lib/python3.9/site-packages/jinja2/sandbox.py", line 394, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 303, in call
    return __obj(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt/clients/jinja.py", line 84, in __call__
    return self.call_macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt_common/clients/jinja.py", line 298, in call_macro
    return macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 768, in __call__
    return self._invoke(arguments, autoescape)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 782, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 55, in macro
  File "/usr/local/lib/python3.9/site-packages/jinja2/sandbox.py", line 394, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 303, in call
    return __obj(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt/context/providers.py", line 1444, in submit_python_job
    return self.adapter.submit_python_job(parsed_model, compiled_code)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/base/impl.py", line 175, in execution_with_log
    response = code_execution_function(*args)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/base/impl.py", line 1536, in submit_python_job
    submission_result = job_helper.submit(compiled_code)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/python_submissions.py", line 75, in submit
    return self._submit_dataproc_job()
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/python_submissions.py", line 144, in _submit_dataproc_job
    return poll_batch_job(
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/dataproc/batch.py", line 38, in poll_batch_job
    response = job_client.get_batch(  # type: ignore
  File "/usr/local/lib/python3.9/site-packages/google/cloud/dataproc_v1/services/batch_controller/client.py", line 950, in get_batch
    response = rpc(
  File "/usr/local/lib/python3.9/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.ServiceUnavailable: 503 502:Bad Gateway

Environment

- OS: docker pull python:3.9.8-slim-buster (Debian)
- Python: Python 3.9.8
- dbt-core: 1.8.3
- dbt-bigquery: 1.8.1

Additional Context

Feels similar to https://github.com/dbt-labs/dbt-bigquery/issues/682, but I believe the Dataproc element means we're dealing with a separate code path that doesn't benefit from RETRYABLE_ERRORS. The stacktrace also doesn't mention dbt/adapters/bigquery/connections.py.

OSalama commented 2 months ago

I've been running my proposed fix (#1275) in production for 2 weeks and have not seen any further occurrences of the issue.

jmesterh commented 4 weeks ago

Just ran across this today as well (Python model / GCP / Dataproc)

Unhandled error while executing /tmp/tmppi1e0ffu/run/common/models/modelname.py
503 Socket closed