Closed jweibel22 closed 7 months ago
@jweibel22 Thanks for opening!
I see two ways in which dbt could "retry" on transient errors:
status:error
or status:skipped
in run_results.json
(#2465).I think the first is preferable, if possible. Ultimately, both are worth doing.
The mileage will vary significantly by adapter; as you say, the first and biggest challenge is reliably identifying which errors are transient and worth retrying. In fact, dbt-bigquery
already has logic for this today, since google.api_core
offers a retry
helper (docs).
If we can identify similar error codes for psycopg2
, we could implement this for dbt-postgres
+ dbt-redshift
as well. (I found a thread here that may hold promise.) Or, if Amazon's new python driver offers more reliable + descriptive error codes, it's one more good reason to consider switching.
In any case, I think the execute
method, or a method called by it, is probably the right place to implement this logic.
@jtcohen6 what if instead of trying to identify all the transient errors for all the different connectors, we start by allowing a list of exceptions to be defined in the profiles.yml that you want to retry on?
I'm going to leave this in the current repo, but since the required changes are adapter-specific, it's likely that we'll want to open issues in the adapter repos as well.
(In the case of Redshift, too, the change would probably involve connections.py
in dbt-postgres
, which is still in this repo)
Would love this functionality as well.
Here's a specific use case.
We are using RDS Aurora Serverless, which has auto-scaling built-in.
When it starts to scale it kicks off connections.
And it starts to scale every time there is load, which is often the case when refreshing lots of dbt models at once.
And then it kicks dbt out and the entire run fails because of maybe one or two failed models, as they were disconnected.
And our runs happen every few hours. By then RDS had scaled down already. And when dbt runs next time, it fails again.
In the end we have pretty much every run fail in some way. The only time it does not fail when the cluster was already scaled out due to some other jobs triggering the scaling event.
To work around that we'll implement forced scaling before dbt runs. But I wish there was just a way to retry certain errors, like disconnects.
I stumbled across this issue as I was debugging some sporadic timeouts when running dbt on our Redshift cluster. I'm commenting to reinforce the need for this issue with my current situation:
From the Redshift connection logs I can see when dbt attempts to connect, and these attempts are successful. However, after about 10 seconds, dbt reports a timeout error (consistent with the default connection_timeout
parameter from the postgres adapter). Moreover, the Redshift connection logs also show that the connections dbt started stay up for longer than 10 seconds, which indicates this could be due to transient network problems.
Having the ability to immediately retry when connecting would be very useful for us. At the time, we are relying on the status:error
workaround/feature but this requires the overhead of restarting our dbt tasks.
Thanks for everything!
👍 on this issue, we are hitting this with Databricks too
same problem happened to us. (with Databricks)
Same issue here using sources that are secured views provided through a Snowflake datashare. Once every 24 hours, the views are dropped and re-created. I don't know why it doesn't happen in a single transaction but the net result is that the views are unavailable for a few seconds. If a job relying on that source happens to be running at that time, it will fail and return an error.
Either a retry logic or any other way to handle this situation gracefully would be great!
Some sort of retry at the model level would be helpful. I have a model in Amazon Aurora PostgreSQL that calls a SageMaker endpoint for ML inferences. Sometimes there are invocation timeouts from the endpoint being too busy, resulting in a failed dbt model. The endpoint autoscales, so simply re-running the model usually works.
I would really like some sort of option that I can apply to specific models, to retry a model N times before considering it as failed.
Hi, How did you'll get through this? Apparently, we are also receiving the below error. And this is intermittent as well, since it works fine the very time the same model is triggered.
Hi, how did you'll get through this? We are facing the similar issue on Redshift now when random reboots happen. We would love to have the retry option in dbt cloud jobs.
Database Error in model data_inspection (models/exposure/tableau/product/data_inspection.sql) 16:28:24 SSL SYSCALL error: EOF detected 16:28:24 compiled Code at target/run/analytics/models/exposure/tableau/product/data_inspection.sql could not connect to server: Connection refused
There is the dbt retry
command since 1.6.0. See #7299 and About dbt retry command.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Our use case is similar to @kinghuang: we have some specific models that we would like to retry upon failure (any failure). We have UDFs calling external services, and so many different transient issues can happen (service issues, network issues, etc.).
The dbt retry
command is not ideal as it involves some overhead in order to orchestrate the retries.
Ideally, we'd be looking for something like this:
{{ config(
retries=2,
retry_delay=30
) }}
The above would mean: if the model fails, wait 30 seconds and then try again, up to a maximum of two times.
It's a fairly simple feature, and it would benefit everyone.
Describe the feature
This has come up in here before. Sometimes transient errors occurs and it would be nice if dbt could automatically retry on those occasions. Right now, those transient errors causes our nightly dbt pipeline to fail, which is blocking downstream pipelines.
Specifically we're dealing with some hard to track dns resolution problems in our network setup which causes some flakiness.
Describe alternatives you've considered
Our dbt pipeline is run from an airflow DAG and the only way to retry is to re-run the DAG which runs the entire dbt pipeline. We could implement better support for running only specific models in our production environment so we can fix the problem faster, however this would still require manual work and cause a significant delay as it wouldn't take place until someone notices the problem in the morning.
Additional context
We're using redshift, but the problem is not related to a specific database.
Who will this benefit?
Everyone that deals with networking and other operational issues.
Are you interested in contributing this feature?
We don't mind contributing on this. The first problem is identifying which errors are transient (worth retrying) and which are not: https://www.psycopg.org/docs/errors.html It might be a good idea to leave this decision to the user and let the user configure a retry strategy (but provide a sensible default)