Open jars opened 2 years ago
We have been experiencing the same issue. It also affects dbt cloud. Thank you @jars for doing the work to reproduce this issue!
Thanks @jars!
There's a specific bug here: dbt should respect the TimeoutError
when deciding whether to retry a job. If it times out, it should retry. That should be as simple as adding TimeoutError
class to the set of RETRYABLE_ERRORS
here:
At least, I think so — but then after thinking a bit about it, I'm less sure! (If it does time out, and the query keeps going, and actually succeeds — but then dbt retries — don't we risk running the same query / performing the same operation twice? Can we lean on dbt's idempotent materializations here? Not to mention the expense...)
Another option we could try taking here: If job_execution_timeout_seconds
is reached, dbt polls the BigQuery API, checking on the job ID, to see if it is still running / still in progress, even if it's exceeded the user-configured job_execution_timeout_seconds
. But then what? Should dbt actually go cancel the job? Just keep waiting up to the job_retry_deadline_seconds
(total amount of time dbt should spend trying to run this model, inclusive of all retries)?
I'd prefer to avoid implementing more logic here than we need.
As a larger point, we are looking to replace a lot of our current timeout/retry code in dbt-bigquery
(legacy + organically developed) with the higher-level retry logic that Google's client libraries make available. As a first example: #230, #231. If users provide granular configuration around timeout + retry, we should still do our best to respect it (by passing it into Google's decorator) — but we should be moving in the direction of handling much, much less of this logic ourselves.
Thanks @jtcohen6, you've given us some ideas to chew on for the next couple of days.
Do you have insight into the priority of #231, and if it would address this specific issue? Or do you think #231 would swap-out _retry_and_handle
for the Retry
decorator without behaviour changes... followed by a separate PR to deal with this timeout issue?
As a larger point, we are looking to replace a lot of our current timeout/retry code in dbt-bigquery (legacy + organically developed) with the higher-level retry logic that Google's client libraries make available. As a first example: https://github.com/dbt-labs/dbt-bigquery/pull/230, https://github.com/dbt-labs/dbt-bigquery/issues/231. If users provide granular configuration around timeout + retry, we should still do our best to respect it (by passing it into Google's decorator) — but we should be moving in the direction of handling much, much less of this logic ourselves.
@jtcohen6 , I totally agree! I think the solution for https://github.com/dbt-labs/dbt-bigquery/issues/231 is that could address this issue here, and it's the ideal solution.
Do you know if anyone is working on https://github.com/dbt-labs/dbt-bigquery/issues/231, and its priority?
this is an important issue for us. I am happy to contribute to the implementation.
Hi all! Wanted to check in on this issue, is a fix in the works? We're running into this problem as well -- so glad to see there's a thoughtful discussion about it!
@hui-zheng sorry for the delayed answer.
If you think you can contribute, please do so on #231, even if it's to scope work or a proof of concept. We are slowly catching up on the backlog, as the team is ramping up, but any help is appreciated ;)
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
@Fleid how do we get this issue back onto the team's backlog? can we reopen this one?
Flagging that we at Superhuman are continuing to encounter this issue as @lillikulak mentions above - any update on progress toward solving this problem?
Describe the bug
The
job_retries
BigQuery job configuration setting is not used in conjunction withjob_execution_timeout_seconds
setting. Instead of retrying a job that times out, dbt immediately fails withOperation did not complete within the designated timeout.
Steps To Reproduce
Sure. First use the official dbt Docker image, and run
dbt init
to initialise a project:Make a
profiles.yml
file similar to this. Take special note of thejob_execution_timeout_seconds
value, which I've exaggerated to 1 second to force a query timeout to occur. Also note thatjob_retries=5
Fill in theproject
key.Then, run the built docker image, executing a dbt run on the example
my_first_dbt_model
model. Map yourprofiles.yml
file, andkeyfile.json
into the container:Expected behavior
The model
my_first_dbt_model
would fail, and retry five times.Screenshots and log output
After dbt errors/exits, the job continues running on BigQuery, and eventually succeeds:
Additional Context
The retries are obeyed in the event of other BigQuery failures. For example, if you introduce an intentional syntax error in your model, it will retry up to
job_retries
.System information
The output of
dbt --version
:The operating system you're using: Using the official Docker Image...
The output of
python --version
: Python 3.10.3