Closed OlofL closed 8 months ago
Thanks for the report, @OlofL. Could you elaborate a bit on what kind of workload you're running? Is your model generating a regular table, streaming table, materialized view, or something else?
Of course. It is our standard daily build and it starts by creating table models, seeds then 88 different snapshots and then continuing with lots of incremental models, a few table models and finally a a number of views.
We have two workspaces, one is using Serverless, one is using Pro SQL Warehouses. The problem only occured in the Pro. By swithcing to Pro instead of Serverless in the other the problem occured there also.
No streaming tables, just a plan batch transformation.
job.txt Attached is the job definition yml with the sensitive data removed. Changing dbt-databricks library to less than 1.7.3 is the workaround we have found works.
The combination of things working when switching to serverless or 1.7.3, as well as the failures giving up after < 3 minutes suggests to me that something has changed in our retry policy in databricks-sql-connector 3.0.1 (which we upgraded to with 1.7.4). Going to work with @susodapop to try to figure out what is different.
Great! Since we are able to recreate the problem in both our environments I'm happy to help with any further testing if possible.
@OlofL if you can get your dbt artifacts to me from one of the failed runs (either file a ticket with your Databricks contact, or you can email it to me at ben.cassell@databricks.com), that would be very helpful.
I can get it from our second environment, so it is in your mailbox now
@OlofL do you see this issue at all with 1.7.3? I'm able to repro the failure this morning with 1.7.3 :/.
Great that you can replicate it. We have only seen it in 1.7.4, not in 1.7.3. But, to be on the safe side we have the settings to <1.7.3, so we are not running version 1.7.3 in our daily build, we are running an earlier version. I can try it again tomorrow with version 1.7.3 if that helps to make sure it is the same problem.
We are also experiencing the same issue with 1.7.4, but not with 1.7.3. 1.7.4 works OK if the SQL Warehouse is already running - but if the workflow has to start the SQL Warehouse that is when we encounter the 503 error.
Adding retries to the task is another workaround for us at this point.
Even after upgrading to 1.7.5, I'm still facing the same bug as described in this issue. I get the exact same error message as above. 1.7.3 is working fine for me.
Does 1.7.5 resolve the issue for you @OlofL @andrewpdillon ?
My full package list:
Package Version
------------------------- ----------
agate 1.7.1
annotated-types 0.6.0
appdirs 1.4.4
argcomplete 3.1.6
attrs 23.2.0
Babel 2.14.0
cachetools 5.3.2
certifi 2023.11.17
cffi 1.16.0
cfgv 3.4.0
chardet 5.2.0
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
commitizen 3.12.0
cryptography 42.0.2
databricks-sdk 0.17.0
databricks-sql-connector 3.0.2
dbt-core 1.7.6
dbt-databricks 1.7.5
dbt-extractor 0.5.1
dbt-semantic-interfaces 0.4.3
dbt-spark 1.7.1
decli 0.6.1
diff_cover 8.0.3
distlib 0.3.8
et-xmlfile 1.1.0
filelock 3.13.1
google-auth 2.27.0
identify 2.5.33
idna 3.6
importlib-metadata 6.11.0
iniconfig 2.0.0
isodate 0.6.1
jaraco.classes 3.3.0
jeepney 0.8.0
Jinja2 3.1.3
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
keyring 24.3.0
leather 0.3.4
Logbook 1.5.3
lz4 4.3.3
MarkupSafe 2.1.4
mashumaro 3.12
minimal-snowplow-tracker 0.0.2
more-itertools 10.2.0
msgpack 1.0.7
networkx 3.2.1
nodeenv 1.8.0
numpy 1.26.3
oauthlib 3.2.2
openpyxl 3.1.2
packaging 23.2
pandas 2.1.4
parsedatetime 2.6
pathspec 0.11.2
pip 23.3.2
platformdirs 4.2.0
pluggy 1.4.0
pre-commit 2.18.1
prompt-toolkit 3.0.36
protobuf 4.25.2
pyarrow 14.0.2
pyasn1 0.5.1
pyasn1-modules 0.3.0
pycparser 2.21
pydantic 2.6.0
pydantic_core 2.16.1
Pygments 2.17.2
pytest 8.0.0
python-dateutil 2.8.2
python-slugify 8.0.2
pytimeparse 1.1.8
pytz 2023.4
PyYAML 6.0.1
questionary 2.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
rpds-py 0.17.1
rsa 4.9
SecretStorage 3.3.3
setuptools 69.0.3
six 1.16.0
sqlfluff 2.3.5
sqlparams 6.0.1
sqlparse 0.4.4
tblib 3.0.0
termcolor 2.4.0
text-unidecode 1.3
thrift 0.16.0
toml 0.10.2
tomlkit 0.12.3
tqdm 4.66.1
typing_extensions 4.9.0
tzdata 2023.4
urllib3 1.26.18
virtualenv 20.25.0
wcwidth 0.2.13
wheel 0.42.0
zipp 3.17.0
Hi all, we had to pull 1.7.5 as well because it didn't completely fix the problem. If you'd like to help validate our latest attempt, can you try installing from the 1.7.4testing branch?
@OlofL @andrewpdillon @henlue please let me know if the above (1.7.4testing branch) resolves the issue.
@OlofL @andrewpdillon @henlue any update? I'm looking for feedback that this branch resolves the issue before I release it, as this circumstance is very hard to repro internally.
I configured a new workflow with 1.7.4 to first verify that I still had the error but I didn't get it this time. Is the error dependent on a specific version of dbt? I will try again.
No, whether you see the error or not is dependent on how quickly compute is available, hence why reproducing is so hard; it is a function of the cloud providers compute availability.
Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.
Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.
Yeah, for me, I have the best luck reproducing at 4:30 pm PST, and I have no idea why lol.
That is when we in Europe start all our major workloads, at 01.30 CET.
I have the same issue but with an older version of dbt: 1.4.9. In my case, I had a scheduled job that was working fine up to the 28th of November last year. Then it suddenly started failing the next day, without having done any changes from our side (development was on hold).
Here's the log for the 28th execution:
01:00:32 Running with dbt=1.4.9
01:00:32 Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:07:03
01:07:03 Concurrency: 4 threads (target='dev')
01:07:03 1 of 17 START sql table model my_schema.dates ...... [RUN]
01:07:03 2 of 17 START sql view model my_schema.table_1 [RUN]
01:07:03 3 of 17 START sql view model my_schema.table_2 [RUN]
01:07:03 4 of 17 START sql view model my_schema.table_3 [RUN]
01:07:16 3 of 17 OK created sql view model my_schema.table_2 [[32mOK[0m in 12.81s]
01:07:16 5 of 17 START sql view model my_schema.table_4 [RUN]
01:07:16 4 of 17 OK created sql view model my_schema.table_3 [[32mOK[0m in 12.92s]
01:07:16 6 of 17 START sql view model my_schema.table_5 [RUN]
01:07:17 2 of 17 OK created sql view model my_schema.table_1 [[32mOK[0m in 13.79s]
01:07:17 7 of 17 START sql view model my_schema.table_6 [RUN]
01:07:24 5 of 17 OK created sql view model my_schema.table_4 [[32mOK[0m in 8.16s]
01:07:24 8 of 17 START sql view model my_schema.table_7 [RUN]
01:07:25 7 of 17 OK created sql view model my_schema.table_6 [[32mOK[0m in 7.83s]
01:07:25 9 of 17 START sql view model my_schema.table_8 [RUN]
01:07:27 6 of 17 OK created sql view model my_schema.table_5 [[32mOK[0m in 11.32s]
01:07:27 10 of 17 START sql view model my_schema.table_9 [RUN]
01:07:33 9 of 17 OK created sql view model my_schema.table_8 [[32mOK[0m in 8.56s]
01:07:33 11 of 17 START sql table model my_schema.table_10 .. [RUN]
01:07:34 10 of 17 OK created sql view model my_schema.table_9 [[32mOK[0m in 6.26s]
01:07:34 12 of 17 START sql view model my_schema.table_11 [RUN]
01:07:34 8 of 17 OK created sql view model my_schema.table_7 [[32mOK[0m in 9.78s]
01:07:34 13 of 17 START sql table model my_schema.talbe_12 [RUN]
01:07:41 12 of 17 OK created sql view model my_schema.table_11 [[32mOK[0m in 7.78s]
01:07:41 14 of 17 START sql view model my_schema.table_13 [RUN]
01:07:45 1 of 17 OK created sql table model my_schema.dates . [[32mOK[0m in 41.68s]
01:07:46 11 of 17 OK created sql table model my_schema.table_10 [[32mOK[0m in 12.81s]
01:07:49 14 of 17 OK created sql view model my_schema.table_13 [[32mOK[0m in 8.01s]
01:07:51 13 of 17 OK created sql table model my_schema.talbe_12 [[32mOK[0m in 17.67s]
01:07:51 15 of 17 START sql table model my_schema.orphans ...... [RUN]
01:07:51 16 of 17 START sql table model my_schema.sources_combined [RUN]
01:08:15 15 of 17 OK created sql table model my_schema.orphans . [[32mOK[0m in 23.43s]
01:08:15 17 of 17 START test accepted_values [RUN]
01:08:17 17 of 17 PASS accepted_values [[32mPASS[0m in 1.83s]
01:09:22 16 of 17 OK created sql table model my_schema.sources_combined [[32mOK[0m in 90.90s]
01:09:23
01:09:23 Finished running 11 view models, 5 table models, 1 test in 0 hours 8 minutes and 50.20 seconds (530.20s).
01:09:23
01:09:23 [32mCompleted successfully[0m
01:09:23
01:09:23 Done. PASS=17 WARN=0 ERROR=0 SKIP=0 TOTAL=17
And the 29th when the error started showing up:
01:00:31 Running with dbt=1.4.9
01:00:31 Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:03:06
01:03:06 Finished running in 0 hours 2 minutes and 34.68 seconds (154.68s).
01:03:06 Encountered an error:
Runtime Error
HTTPSConnectionPool(host='adb-xxxxxxx.x.azuredatabricks.net', port=443): Max retries exceeded with url: /sql/1.0/warehouses/xxxxxxxx (Caused by ResponseError('too many 503 error responses'))
System info:
Hope this helps shed some more light into the issue
Describe the bug
A workflow dbt job terminates with an error Max retries exceeded with url: ... (Caused by ResponseError('too many 503 error responses')) just when it starts sending SQL commands to the cluster. This occurs only in connector version 1.7.4, not in version 1.7.3. This occurs only with SQL Warehouse, not with SQL Warehouse Serverless
Steps To Reproduce
The problem occurs in a workflow in a databricks workspace with the following settings Running on Azure, Databricks Premium, not Unity Catalog Job cluster single node Standard_DS3_v2 Work cluster SQL Warehouse Pro X-Small, Cluster count: Active 0 Min 1 Max 1, Channel Current, Cost optimized git source Azure Devops Settings for library version dbt-databricks>=1.0.0,<2.0.0
Start the workflow. After the job cluster has been created and the SQL Warehouse has been started an error is shown in the log:
02:12:11
02:12:11 Finished running in 0 hours 2 minutes and 32.34 seconds (152.34s). 02:12:11 Encountered an error: Runtime Error HTTPSConnectionPool(host='[removed]', port=443): Max retries exceeded with url: /sql/1.0/warehouses/[removed] (Caused by ResponseError('too many 503 error responses'))
Changing to SQL Warehouse of type Serverless, X-Small, Cluster count: Active 0 Min 1 Max 1, Current solves the problem. Running on older version of dbt-databricks library solves the problem (dbt-databricks>=1.0.0,<1.7.3) The problem appeared on Jan 25 when the version was updated from 1.7.3 -> 1.7.4. I have not been able to reproduce the same problem when starting dbt interactive, just as a workflow, but that can be my mistake, this has been a stressful couple days due to this.
Expected behavior
Log output: 02:08:27 Running with dbt=1.7.5 02:08:28 Registered adapter: databricks=1.7.3 02:08:28 Unable to do partial parsing because saved manifest not found. Starting full parse. 02:08:46 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 919 macros, 0 groups, 0 semantic models 02:08:46
02:14:30 Concurrency: 12 threads (target='prod') 02:14:30
02:14:31 1 of 1835 START sql table model staging.rollup12helper ......................... [RUN] 02:14:31 2 of 1835 START sql table model staging.rollup24helper ......................... [RUN]
Screenshots and log output
N/A
System information
dbt 1.7.5 dbt-databricks 1.7.4
Configured in a dbt workflow, not much control over settings there.
Additional context
N/A