Retry on transient errors

jweibel22 commented 3 years ago

Describe the feature

This has come up in here before. Sometimes transient errors occurs and it would be nice if dbt could automatically retry on those occasions. Right now, those transient errors causes our nightly dbt pipeline to fail, which is blocking downstream pipelines.

Specifically we're dealing with some hard to track dns resolution problems in our network setup which causes some flakiness.

Describe alternatives you've considered

Our dbt pipeline is run from an airflow DAG and the only way to retry is to re-run the DAG which runs the entire dbt pipeline. We could implement better support for running only specific models in our production environment so we can fix the problem faster, however this would still require manual work and cause a significant delay as it wouldn't take place until someone notices the problem in the morning.

Additional context

We're using redshift, but the problem is not related to a specific database.

Who will this benefit?

Everyone that deals with networking and other operational issues.

Are you interested in contributing this feature?

We don't mind contributing on this. The first problem is identifying which errors are transient (worth retrying) and which are not: https://www.psycopg.org/docs/errors.html It might be a good idea to leave this decision to the user and let the user configure a retry strategy (but provide a sensible default)

jtcohen6 commented 3 years ago

@jweibel22 Thanks for opening!

I see two ways in which dbt could "retry" on transient errors:

Query-level: If a query returns a transient error from the database, dbt's connection logic should handle that error and retry the query.
Invocation-level: If a single dbt model failed, rather than re-running the entire DAG, it should be possible to select and re-run just the models with status:error or status:skipped in run_results.json (#2465).

I think the first is preferable, if possible. Ultimately, both are worth doing.

The mileage will vary significantly by adapter; as you say, the first and biggest challenge is reliably identifying which errors are transient and worth retrying. In fact, dbt-bigquery already has logic for this today, since google.api_core offers a retry helper (docs).

https://github.com/fishtown-analytics/dbt/blob/8d39ef16b67fcc3c30e8a18751b4e64831ec9aaf/plugins/bigquery/dbt/adapters/bigquery/connections.py#L44-L49

https://github.com/fishtown-analytics/dbt/blob/8d39ef16b67fcc3c30e8a18751b4e64831ec9aaf/plugins/bigquery/dbt/adapters/bigquery/connections.py#L337

If we can identify similar error codes for psycopg2, we could implement this for dbt-postgres + dbt-redshift as well. (I found a thread here that may hold promise.) Or, if Amazon's new python driver offers more reliable + descriptive error codes, it's one more good reason to consider switching.

In any case, I think the execute method, or a method called by it, is probably the right place to implement this logic.

alecrubin commented 3 years ago

@jtcohen6 what if instead of trying to identify all the transient errors for all the different connectors, we start by allowing a list of exceptions to be defined in the profiles.yml that you want to retry on?

jtcohen6 commented 3 years ago

I'm going to leave this in the current repo, but since the required changes are adapter-specific, it's likely that we'll want to open issues in the adapter repos as well.

(In the case of Redshift, too, the change would probably involve connections.py in dbt-postgres, which is still in this repo)

moltar commented 3 years ago

Would love this functionality as well.

Here's a specific use case.

We are using RDS Aurora Serverless, which has auto-scaling built-in.

When it starts to scale it kicks off connections.

And it starts to scale every time there is load, which is often the case when refreshing lots of dbt models at once.

And then it kicks dbt out and the entire run fails because of maybe one or two failed models, as they were disconnected.

And our runs happen every few hours. By then RDS had scaled down already. And when dbt runs next time, it fails again.

In the end we have pretty much every run fail in some way. The only time it does not fail when the cluster was already scaled out due to some other jobs triggering the scaling event.

To work around that we'll implement forced scaling before dbt runs. But I wish there was just a way to retry certain errors, like disconnects.

tomasfarias commented 2 years ago

I stumbled across this issue as I was debugging some sporadic timeouts when running dbt on our Redshift cluster. I'm commenting to reinforce the need for this issue with my current situation:

From the Redshift connection logs I can see when dbt attempts to connect, and these attempts are successful. However, after about 10 seconds, dbt reports a timeout error (consistent with the default connection_timeout parameter from the postgres adapter). Moreover, the Redshift connection logs also show that the connections dbt started stay up for longer than 10 seconds, which indicates this could be due to transient network problems.

Having the ability to immediately retry when connecting would be very useful for us. At the time, we are relying on the status:error workaround/feature but this requires the overhead of restarting our dbt tasks.

Thanks for everything!

francescomucio commented 2 years ago

👍 on this issue, we are hitting this with Databricks too

xg1990 commented 2 years ago

same problem happened to us. (with Databricks)

raphaelvarieras commented 1 year ago

Same issue here using sources that are secured views provided through a Snowflake datashare. Once every 24 hours, the views are dropped and re-created. I don't know why it doesn't happen in a single transaction but the net result is that the views are unavailable for a few seconds. If a job relying on that source happens to be running at that time, it will fail and return an error.

Either a retry logic or any other way to handle this situation gracefully would be great!

kinghuang commented 1 year ago

Some sort of retry at the model level would be helpful. I have a model in Amazon Aurora PostgreSQL that calls a SageMaker endpoint for ML inferences. Sometimes there are invocation timeouts from the endpoint being too busy, resulting in a failed dbt model. The endpoint autoscales, so simply re-running the model usually works.

I would really like some sort of option that I can apply to specific models, to retry a model N times before considering it as failed.

dharitsura248 commented 1 year ago

Hi, How did you'll get through this? Apparently, we are also receiving the below error. And this is intermittent as well, since it works fine the very time the same model is triggered.

yuna-tang commented 1 year ago

Hi, how did you'll get through this? We are facing the similar issue on Redshift now when random reboots happen. We would love to have the retry option in dbt cloud jobs.

Database Error in model data_inspection (models/exposure/tableau/product/data_inspection.sql) 16:28:24 SSL SYSCALL error: EOF detected 16:28:24 compiled Code at target/run/analytics/models/exposure/tableau/product/data_inspection.sql could not connect to server: Connection refused

kinghuang commented 1 year ago

There is the dbt retry command since 1.6.0. See #7299 and About dbt retry command.

github-actions[bot] commented 8 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 7 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

mroy-seedbox commented 7 months ago

Our use case is similar to @kinghuang: we have some specific models that we would like to retry upon failure (any failure). We have UDFs calling external services, and so many different transient issues can happen (service issues, network issues, etc.).

The dbt retry command is not ideal as it involves some overhead in order to orchestrate the retries.

Ideally, we'd be looking for something like this:

{{ config(
  retries=2,
  retry_delay=30
) }}

The above would mean: if the model fails, wait 30 seconds and then try again, up to a maximum of two times.

It's a fairly simple feature, and it would benefit everyone.

dbt-labs / dbt-core