Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4

OlofL commented 8 months ago

Describe the bug

A workflow dbt job terminates with an error Max retries exceeded with url: ... (Caused by ResponseError('too many 503 error responses')) just when it starts sending SQL commands to the cluster. This occurs only in connector version 1.7.4, not in version 1.7.3. This occurs only with SQL Warehouse, not with SQL Warehouse Serverless

Steps To Reproduce

The problem occurs in a workflow in a databricks workspace with the following settings Running on Azure, Databricks Premium, not Unity Catalog Job cluster single node Standard_DS3_v2 Work cluster SQL Warehouse Pro X-Small, Cluster count: Active 0 Min 1 Max 1, Channel Current, Cost optimized git source Azure Devops Settings for library version dbt-databricks>=1.0.0,<2.0.0

Start the workflow. After the job cluster has been created and the SQL Warehouse has been started an error is shown in the log:

dbt build --profiles-dir ../misc/misc/ -t prod -f 02:09:19 Running with dbt=1.7.5 02:09:20 Registered adapter: databricks=1.7.4 02:09:20 Unable to do partial parsing because saved manifest not found. Starting full parse. 02:09:38 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 920 macros, 0 groups, 0 semantic models 02:09:38
02:12:11
02:12:11 Finished running in 0 hours 2 minutes and 32.34 seconds (152.34s). 02:12:11 Encountered an error: Runtime Error HTTPSConnectionPool(host='[removed]', port=443): Max retries exceeded with url: /sql/1.0/warehouses/[removed] (Caused by ResponseError('too many 503 error responses'))

Changing to SQL Warehouse of type Serverless, X-Small, Cluster count: Active 0 Min 1 Max 1, Current solves the problem. Running on older version of dbt-databricks library solves the problem (dbt-databricks>=1.0.0,<1.7.3) The problem appeared on Jan 25 when the version was updated from 1.7.3 -> 1.7.4. I have not been able to reproduce the same problem when starting dbt interactive, just as a workflow, but that can be my mistake, this has been a stressful couple days due to this.

Expected behavior

Log output: 02:08:27 Running with dbt=1.7.5 02:08:28 Registered adapter: databricks=1.7.3 02:08:28 Unable to do partial parsing because saved manifest not found. Starting full parse. 02:08:46 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 919 macros, 0 groups, 0 semantic models 02:08:46
02:14:30 Concurrency: 12 threads (target='prod') 02:14:30
02:14:31 1 of 1835 START sql table model staging.rollup12helper ......................... [RUN] 02:14:31 2 of 1835 START sql table model staging.rollup24helper ......................... [RUN]

Screenshots and log output

N/A

System information

dbt 1.7.5 dbt-databricks 1.7.4

Configured in a dbt workflow, not much control over settings there.

Additional context

N/A

lennartkats-db commented 8 months ago

Thanks for the report, @OlofL. Could you elaborate a bit on what kind of workload you're running? Is your model generating a regular table, streaming table, materialized view, or something else?

OlofL commented 8 months ago

Of course. It is our standard daily build and it starts by creating table models, seeds then 88 different snapshots and then continuing with lots of incremental models, a few table models and finally a a number of views.

We have two workspaces, one is using Serverless, one is using Pro SQL Warehouses. The problem only occured in the Pro. By swithcing to Pro instead of Serverless in the other the problem occured there also.

OlofL commented 8 months ago

No streaming tables, just a plan batch transformation.

OlofL commented 8 months ago

job.txt Attached is the job definition yml with the sensitive data removed. Changing dbt-databricks library to less than 1.7.3 is the workaround we have found works.

benc-db commented 8 months ago

The combination of things working when switching to serverless or 1.7.3, as well as the failures giving up after < 3 minutes suggests to me that something has changed in our retry policy in databricks-sql-connector 3.0.1 (which we upgraded to with 1.7.4). Going to work with @susodapop to try to figure out what is different.

OlofL commented 8 months ago

Great! Since we are able to recreate the problem in both our environments I'm happy to help with any further testing if possible.

benc-db commented 8 months ago

@OlofL if you can get your dbt artifacts to me from one of the failed runs (either file a ticket with your Databricks contact, or you can email it to me at ben.cassell@databricks.com), that would be very helpful.

OlofL commented 8 months ago

I can get it from our second environment, so it is in your mailbox now

benc-db commented 8 months ago

@OlofL do you see this issue at all with 1.7.3? I'm able to repro the failure this morning with 1.7.3 :/.

OlofL commented 8 months ago

Great that you can replicate it. We have only seen it in 1.7.4, not in 1.7.3. But, to be on the safe side we have the settings to <1.7.3, so we are not running version 1.7.3 in our daily build, we are running an earlier version. I can try it again tomorrow with version 1.7.3 if that helps to make sure it is the same problem.

andrewpdillon commented 8 months ago

We are also experiencing the same issue with 1.7.4, but not with 1.7.3. 1.7.4 works OK if the SQL Warehouse is already running - but if the workflow has to start the SQL Warehouse that is when we encounter the 503 error.

Adding retries to the task is another workaround for us at this point.

henlue commented 8 months ago

Even after upgrading to 1.7.5, I'm still facing the same bug as described in this issue. I get the exact same error message as above. 1.7.3 is working fine for me.

Does 1.7.5 resolve the issue for you @OlofL @andrewpdillon ?

My full package list:

Package                   Version
------------------------- ----------
agate                     1.7.1
annotated-types           0.6.0 
appdirs                   1.4.4                                      
argcomplete               3.1.6 
attrs                     23.2.0
Babel                     2.14.0
cachetools                5.3.2                                      
certifi                   2023.11.17
cffi                      1.16.0
cfgv                      3.4.0
chardet                   5.2.0 
charset-normalizer        3.3.2 
click                     8.1.7 
colorama                  0.4.6
commitizen                3.12.0
cryptography              42.0.2
databricks-sdk            0.17.0
databricks-sql-connector  3.0.2
dbt-core                  1.7.6
dbt-databricks            1.7.5 
dbt-extractor             0.5.1 
dbt-semantic-interfaces   0.4.3 
dbt-spark                 1.7.1 
decli                     0.6.1
diff_cover                8.0.3 
distlib                   0.3.8  
et-xmlfile                1.1.0  
filelock                  3.13.1
google-auth               2.27.0
identify                  2.5.33
idna                      3.6   
importlib-metadata        6.11.0
iniconfig                 2.0.0
isodate                   0.6.1
jaraco.classes            3.3.0
jeepney                   0.8.0
Jinja2                    3.1.3
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
keyring                   24.3.0
leather                   0.3.4
Logbook                   1.5.3
lz4                       4.3.3
MarkupSafe                2.1.4
mashumaro                 3.12
minimal-snowplow-tracker  0.0.2
more-itertools            10.2.0
msgpack                   1.0.7
networkx                  3.2.1
nodeenv                   1.8.0
numpy                     1.26.3
oauthlib                  3.2.2
openpyxl                  3.1.2
packaging                 23.2
pandas                    2.1.4
parsedatetime             2.6
pathspec                  0.11.2
pip                       23.3.2
platformdirs              4.2.0
pluggy                    1.4.0
pre-commit                2.18.1
prompt-toolkit            3.0.36
protobuf                  4.25.2
pyarrow                   14.0.2
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pycparser                 2.21
pydantic                  2.6.0
pydantic_core             2.16.1
Pygments                  2.17.2
pytest                    8.0.0
python-dateutil           2.8.2
python-slugify            8.0.2
pytimeparse               1.1.8
pytz                      2023.4
PyYAML                    6.0.1
questionary               2.0.1
referencing               0.33.0
regex                     2023.12.25
requests                  2.31.0
rpds-py                   0.17.1
rsa                       4.9
SecretStorage             3.3.3
setuptools                69.0.3
six                       1.16.0
sqlfluff                  2.3.5
sqlparams                 6.0.1
sqlparse                  0.4.4
tblib                     3.0.0
termcolor                 2.4.0
text-unidecode            1.3
thrift                    0.16.0
toml                      0.10.2
tomlkit                   0.12.3
tqdm                      4.66.1
typing_extensions         4.9.0
tzdata                    2023.4
urllib3                   1.26.18
virtualenv                20.25.0
wcwidth                   0.2.13
wheel                     0.42.0
zipp                      3.17.0

benc-db commented 8 months ago

Hi all, we had to pull 1.7.5 as well because it didn't completely fix the problem. If you'd like to help validate our latest attempt, can you try installing from the 1.7.4testing branch?

benc-db commented 8 months ago

@OlofL @andrewpdillon @henlue please let me know if the above (1.7.4testing branch) resolves the issue.

benc-db commented 8 months ago

@OlofL @andrewpdillon @henlue any update? I'm looking for feedback that this branch resolves the issue before I release it, as this circumstance is very hard to repro internally.

OlofL commented 8 months ago

I configured a new workflow with 1.7.4 to first verify that I still had the error but I didn't get it this time. Is the error dependent on a specific version of dbt? I will try again.

benc-db commented 8 months ago

No, whether you see the error or not is dependent on how quickly compute is available, hence why reproducing is so hard; it is a function of the cloud providers compute availability.

OlofL commented 8 months ago

Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.

benc-db commented 8 months ago

Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.

Yeah, for me, I have the best luck reproducing at 4:30 pm PST, and I have no idea why lol.

OlofL commented 8 months ago

That is when we in Europe start all our major workloads, at 01.30 CET.

lucascajal commented 8 months ago

I have the same issue but with an older version of dbt: 1.4.9. In my case, I had a scheduled job that was working fine up to the 28th of November last year. Then it suddenly started failing the next day, without having done any changes from our side (development was on hold).

Here's the log for the 28th execution:

01:00:32  Running with dbt=1.4.9
01:00:32  Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:07:03
01:07:03  Concurrency: 4 threads (target='dev')
01:07:03  1 of 17 START sql table model my_schema.dates ...... [RUN]
01:07:03  2 of 17 START sql view model my_schema.table_1  [RUN]
01:07:03  3 of 17 START sql view model my_schema.table_2  [RUN]
01:07:03  4 of 17 START sql view model my_schema.table_3  [RUN]
01:07:16  3 of 17 OK created sql view model my_schema.table_2  [[32mOK[0m in 12.81s]
01:07:16  5 of 17 START sql view model my_schema.table_4  [RUN]
01:07:16  4 of 17 OK created sql view model my_schema.table_3  [[32mOK[0m in 12.92s]
01:07:16  6 of 17 START sql view model my_schema.table_5  [RUN]
01:07:17  2 of 17 OK created sql view model my_schema.table_1  [[32mOK[0m in 13.79s]
01:07:17  7 of 17 START sql view model my_schema.table_6  [RUN]
01:07:24  5 of 17 OK created sql view model my_schema.table_4  [[32mOK[0m in 8.16s]
01:07:24  8 of 17 START sql view model my_schema.table_7  [RUN]
01:07:25  7 of 17 OK created sql view model my_schema.table_6  [[32mOK[0m in 7.83s]
01:07:25  9 of 17 START sql view model my_schema.table_8  [RUN]
01:07:27  6 of 17 OK created sql view model my_schema.table_5  [[32mOK[0m in 11.32s]
01:07:27  10 of 17 START sql view model my_schema.table_9  [RUN]
01:07:33  9 of 17 OK created sql view model my_schema.table_8  [[32mOK[0m in 8.56s]
01:07:33  11 of 17 START sql table model my_schema.table_10 .. [RUN]
01:07:34  10 of 17 OK created sql view model my_schema.table_9  [[32mOK[0m in 6.26s]
01:07:34  12 of 17 START sql view model my_schema.table_11  [RUN]
01:07:34  8 of 17 OK created sql view model my_schema.table_7  [[32mOK[0m in 9.78s]
01:07:34  13 of 17 START sql table model my_schema.talbe_12  [RUN]
01:07:41  12 of 17 OK created sql view model my_schema.table_11  [[32mOK[0m in 7.78s]
01:07:41  14 of 17 START sql view model my_schema.table_13  [RUN]
01:07:45  1 of 17 OK created sql table model my_schema.dates . [[32mOK[0m in 41.68s]
01:07:46  11 of 17 OK created sql table model my_schema.table_10  [[32mOK[0m in 12.81s]
01:07:49  14 of 17 OK created sql view model my_schema.table_13  [[32mOK[0m in 8.01s]
01:07:51  13 of 17 OK created sql table model my_schema.talbe_12  [[32mOK[0m in 17.67s]
01:07:51  15 of 17 START sql table model my_schema.orphans ...... [RUN]
01:07:51  16 of 17 START sql table model my_schema.sources_combined  [RUN]
01:08:15  15 of 17 OK created sql table model my_schema.orphans . [[32mOK[0m in 23.43s]
01:08:15  17 of 17 START test accepted_values  [RUN]
01:08:17  17 of 17 PASS accepted_values  [[32mPASS[0m in 1.83s]
01:09:22  16 of 17 OK created sql table model my_schema.sources_combined  [[32mOK[0m in 90.90s]
01:09:23
01:09:23  Finished running 11 view models, 5 table models, 1 test in 0 hours 8 minutes and 50.20 seconds (530.20s).
01:09:23
01:09:23  [32mCompleted successfully[0m
01:09:23
01:09:23  Done. PASS=17 WARN=0 ERROR=0 SKIP=0 TOTAL=17

And the 29th when the error started showing up:

01:00:31  Running with dbt=1.4.9
01:00:31  Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:03:06
01:03:06  Finished running  in 0 hours 2 minutes and 34.68 seconds (154.68s).
01:03:06  Encountered an error:
Runtime Error
  HTTPSConnectionPool(host='adb-xxxxxxx.x.azuredatabricks.net', port=443): Max retries exceeded with url: /sql/1.0/warehouses/xxxxxxxx (Caused by ResponseError('too many 503 error responses'))

System info:

Running on a dbt cloud job (version 1.4.9)
Azure databricks
Executing in a SQL endpoint

Hope this helps shed some more light into the issue

databricks / dbt-databricks