Retry failed models due to connection or server error

databricks / dbt-databricks

A dbt adapter for Databricks.

https://databricks.com

Apache License 2.0

226 stars 119 forks source link

Retry failed models due to connection or server error #767

Closed septponf closed 1 month ago

septponf commented 3 months ago

If a model fails due to intermittent failures not related to the model itself, it would be nice to have an auto retry. For example, during the summer we have had some scheduled model failures due to "Remote end closed connection without response" or "Query could not be scheduled: HTTP Response code: 503. Please try again later. SQLSTATE: XX000".

For context we are executing DBT as a Databricks job using the DBT Task and SQL Serverless for compute.

For reference, I believe the bigquery adapter has such features.

benc-db commented 3 months ago

In order to get that message, it generally has already retried for 15 minutes. What did you have in mind?

septponf commented 3 months ago

In order to get that message, it generally has already retried for 15 minutes. What did you have in mind?

Ok, Well looking at the model execution timings it does not look like any retries where attempted. See example log below.


...
03:35:29  Running with dbt=1.7.17
03:35:31  Registered adapter: databricks=1.7.10
...
03:42:45  134 of 158 START sql table model xx  [RUN]
03:42:46  134 of 158 ERROR creating sql table model xx  [ERROR in 1.09s]
...
03:43:47  Finished running 99 view models, 59 table models, 1 hook in 0 hours 6 minutes and 47.45 seconds (407.45s).
03:43:47  
03:43:47  Completed with 1 error and 0 warnings:
03:43:47  
03:43:47    Runtime Error in model xx (models/path/to/xx.sql)
  Runtime Error
    ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
03:43:47

benc-db commented 3 months ago

Hmm, this is very strange, as it indicates the connection was actively broken; I don't think we have retry in that circumstance, but we can file a bug against databricks-sql-connector to remedy that. Does this happen often? If so, I would file a ticket with Databricks to understand why you're getting disconnected.

septponf commented 3 months ago

Happened a couple of times in July, and tonight actually. There is no entry in the query history, so I am assuming the query does not even get provisioned in the SQL engine.

It has been the same model each time which is strange. It has very simple logic so I am thinking it could be due to execution timing. Maybe it is due to parallelism. I see in the dbt log that 7 models are starting at the same time (same second). We run dbt in 7 threads on a 6 DBU serverless cluster.

Could be that the request sometimes is not queued properly or that we are exceeding some API rate limit? I will file a support ticket to investigate this further.

Nevertheless, it would be nice with an auto retry :-). I am considering to add a dbt retry task after dbt run in case it fails.

benc-db commented 3 months ago

Would you mind filing against https://github.com/databricks/databricks-sql-python? Basically explain that we don't retry when we get 'Remote end closed connection without response', but that it should be safe to do so? In that package we aim to retry safe commands, i.e. ones that either are idempotent or that we know the server didn't receive, but in this case we have evidence that getting this response means the server didn't receive or otherwise that no action was taken. I will also take into consideration some version of model retry, but do not have capacity to explore right now.

septponf commented 2 months ago

Ok. I filed a new issue. https://github.com/databricks/databricks-sql-python/issues/433

septponf commented 2 months ago

I created a ticket with Microsoft to check if anything was going on server side that would cut the connection. They consulted with the Databricks engineering team, and they found that it occurs when connections are reused after being idle for more than 180 seconds.

A max idle change is included in dbt-databricks version 1.7.14 that address the problem. We were using 1.7.10 at the time.

NodeJSmith commented 1 month ago

I can validate that we have not seen this issue since pinning to dbt-databricks == 1.8.5

cyberjar09 commented 1 month ago

also, I have been using 1.7.16 and ~~its been good! 👍~~ the situation improved but still seeing the issue crop up from time to time 😞

@benc-db you may need to reopen this

bolinzzz commented 1 month ago

we bumped dbt-databricks to 1.7.16 and it did not get rid of this issue.