[CT-1015] job_retries not obeyed when job_execution_timeout_seconds is exceeded

jars commented 2 years ago

Describe the bug

The job_retries BigQuery job configuration setting is not used in conjunction with job_execution_timeout_seconds setting. Instead of retrying a job that times out, dbt immediately fails with Operation did not complete within the designated timeout.

Steps To Reproduce

Sure. First use the official dbt Docker image, and run dbt init to initialise a project:

docker build -t dbt-bigquery-issue-263 --no-cache -<<'EOF'
FROM ghcr.io/dbt-labs/dbt-bigquery:1.1.0
WORKDIR /tmp
RUN dbt init --skip-profile-setup issue_263
EOF

Make a profiles.yml file similar to this. Take special note of the job_execution_timeout_seconds value, which I've exaggerated to 1 second to force a query timeout to occur. Also note that job_retries=5 Fill in the projectkey.

issue_263:
  target: default
  outputs:
    default:
      type: bigquery
      method: service-account
      project: '<your project>'
      dataset: 'issue_263'
      threads: 1
      keyfile: /root/keyfile.json
      job_execution_timeout_seconds: 1
      job_retry_deadline_seconds: 200
      job_creation_timeout_seconds: 30
      job_retries: 5
      priority: interactive
      location: US

Then, run the built docker image, executing a dbt run on the example my_first_dbt_model model. Map your profiles.yml file, and keyfile.json into the container:

docker run  \
-v `pwd`/profiles.yml:/root/.dbt/profiles.yml \
-v `pwd`/keyfile.json:/root/keyfile.json dbt-bigquery-issue-263 run \
--project-dir /tmp/issue_263 -m my_first_dbt_model

Expected behavior

The model my_first_dbt_model would fail, and retry five times.

Screenshots and log output

After dbt errors/exits, the job continues running on BigQuery, and eventually succeeds:

Additional Context

The retries are obeyed in the event of other BigQuery failures. For example, if you introduce an intentional syntax error in your model, it will retry up to job_retries.

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.2.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.1.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: Using the official Docker Image...

root@7fbf851eb622:/# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:    11
Codename:   bullseye

The output of python --version: Python 3.10.3

codigo-ergo-sum commented 2 years ago

We have been experiencing the same issue. It also affects dbt cloud. Thank you @jars for doing the work to reproduce this issue!

jtcohen6 commented 2 years ago

Thanks @jars!

There's a specific bug here: dbt should respect the TimeoutError when deciding whether to retry a job. If it times out, it should retry. That should be as simple as adding TimeoutError class to the set of RETRYABLE_ERRORS here:

https://github.com/dbt-labs/dbt-bigquery/blob/9c36aa239fb674faebfadab0a2b6a38cd201da86/dbt/adapters/bigquery/connections.py#L51-L56

At least, I think so — but then after thinking a bit about it, I'm less sure! (If it does time out, and the query keeps going, and actually succeeds — but then dbt retries — don't we risk running the same query / performing the same operation twice? Can we lean on dbt's idempotent materializations here? Not to mention the expense...)

Another option we could try taking here: If job_execution_timeout_seconds is reached, dbt polls the BigQuery API, checking on the job ID, to see if it is still running / still in progress, even if it's exceeded the user-configured job_execution_timeout_seconds. But then what? Should dbt actually go cancel the job? Just keep waiting up to the job_retry_deadline_seconds (total amount of time dbt should spend trying to run this model, inclusive of all retries)?

I'd prefer to avoid implementing more logic here than we need.

As a larger point, we are looking to replace a lot of our current timeout/retry code in dbt-bigquery (legacy + organically developed) with the higher-level retry logic that Google's client libraries make available. As a first example: #230, #231. If users provide granular configuration around timeout + retry, we should still do our best to respect it (by passing it into Google's decorator) — but we should be moving in the direction of handling much, much less of this logic ourselves.

jars commented 2 years ago

Thanks @jtcohen6, you've given us some ideas to chew on for the next couple of days.

Do you have insight into the priority of #231, and if it would address this specific issue? Or do you think #231 would swap-out _retry_and_handle for the Retry decorator without behaviour changes... followed by a separate PR to deal with this timeout issue?

hui-zheng commented 2 years ago

As a larger point, we are looking to replace a lot of our current timeout/retry code in dbt-bigquery (legacy + organically developed) with the higher-level retry logic that Google's client libraries make available. As a first example: https://github.com/dbt-labs/dbt-bigquery/pull/230, https://github.com/dbt-labs/dbt-bigquery/issues/231. If users provide granular configuration around timeout + retry, we should still do our best to respect it (by passing it into Google's decorator) — but we should be moving in the direction of handling much, much less of this logic ourselves.

@jtcohen6 , I totally agree! I think the solution for https://github.com/dbt-labs/dbt-bigquery/issues/231 is that could address this issue here, and it's the ideal solution.

Do you know if anyone is working on https://github.com/dbt-labs/dbt-bigquery/issues/231, and its priority?

this is an important issue for us. I am happy to contribute to the implementation.

lillikulak commented 2 years ago

Hi all! Wanted to check in on this issue, is a fix in the works? We're running into this problem as well -- so glad to see there's a thoughtful discussion about it!

Fleid commented 1 year ago

@hui-zheng sorry for the delayed answer.

If you think you can contribute, please do so on #231, even if it's to scope work or a proof of concept. We are slowly catching up on the backlog, as the team is ramping up, but any help is appreciated ;)

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

lillikulak commented 1 year ago

@Fleid how do we get this issue back onto the team's backlog? can we reopen this one?

cbyington commented 5 months ago

Flagging that we at Superhuman are continuing to encounter this issue as @lillikulak mentions above - any update on progress toward solving this problem?

dbt-labs / dbt-bigquery