databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

`connect_retries` and `connect_timeout` parameters don't have an effect #778

Open henlue opened 2 months ago

henlue commented 2 months ago

Describe the bug

The connect_retries and connect_timeout parameters in the profiles.yml don't have the effect that is described in the docs.

The retry functionality seems to be implemented, but the list of exceptions for which a retry happens is empty by default (here and here). It is possible to configure the connector to retry on all exceptions by setting retry_all: true and this will make the connect_retries and connect_timeout have the documented effects, but the retry_all parameter is not documented. I only found it while checking the code.

Depending on the desired behavior I see various ways to fix this:

  1. Keep the existing behavior and update the documentation. For example by adding the retry_all parameter.
  2. Change the existing behavior to match the documentation. For example: a) Add transient exceptions to the retryable_exceptions list b) Set retry_all to true by default (not sure about the side effects though) c) Forward the connect_retries and connect_timeout parameters to the databricks sql connector, if this is possible.

I would be willing to implement a fix or to take a deeper look into the implications of the various fixes I've described.

Steps To Reproduce

I've created a profiles.yml with invalid connection parameters and a high number of connect_retries:

databricks:
  outputs:
    test:
      type: databricks
      host: invalid
      http_path: invalid
      token: invalid
      schema: schema
      connect_retries: 1000

then executed dbt run

Expected behavior

I expect 1000 retries. Instead dbt tries to establish the connection for 15 minutes, like it does when connect_retries is set to 1 and then fails.

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark:      1.8.0 - Up to date!
  - databricks: 1.8.3 - Update available!

The operating system you're using: Ubuntu 22.04 The output of python --version: Python 3.10.12

Additional context

We use a classic warehouse on Azure for our daily jobs. By default dbt databricks tries for 15 minutes to establish a connection to the warehouse, but sometimes this is not enough for the warehouse to start.

caineblood commented 2 months ago

The above comment asking you to download a file is malware to steal your account; do not under any circumstances download or run it. The post needs to be removed. If you have attempted to run it please have your system cleaned and your account secured immediately.

benc-db commented 2 months ago

Thanks for the report. Quite a few users have been asking about retries lately, so I think I'll need to look into it.

Tonayya commented 1 month ago

Hi @benc-db quick question on the connection retries actually. Which type of connection failures would this functionality actually retry on? For example, if connection fails due to cluster maintenance, would this retry the connection?

benc-db commented 1 month ago

Connection retries apply to situations where the SQL Gateway returns a 429 or 503, i.e. signals that it has not scheduled the request due to insufficient resources/busy. I believe this covers your case of cluster maintenance, but not 100% certain. You can ask on sql connector page for more details.