Retries should retry on known retriable database errors in addition to connection errors

ClickHouse / dbt-clickhouse

The Clickhouse plugin for dbt (data build tool)

Apache License 2.0

246 stars 108 forks source link

Retries should retry on known retriable database errors in addition to connection errors #165

Open andaag opened 1 year ago

andaag commented 1 year ago

Describe the bug

Based on this description:

retries: [1] # Number of times to retry a "retriable" database exception (such as a 503 'Service Unavailable' error)

I was kinda expecting retries to also retry failures in executing queries. Such as:

DB::Exception: Timeout: Cannot enqueue query on this replica, most likely because replica is busy with previous queue entries.

But looking at the code it looks like retries is set on retry_connection from SQLConnectionManager, which retries on connection failures, not on query failures.

Is this intended? Am I understanding it wrong? This might be a bit of a mix between either a bug report or a feature request, depending on what the intended behavior is! 😀

simpl1g commented 1 year ago

But how do you know if some part of your data is already inserted or not? Making retries in a database that does not have transactions is pretty dangerous, you can end up with duplicates. We had a lot of problems until we disabled retries in code. So our rule of thumb when we write code - all tasks should be idempotent, if cancel task in the middle and rerun it multiple times we should have correct data

genzgd commented 1 year ago

You understand it correctly. As you noted, the documentation should be more explicit in that retries only refers to connection failures in cases where it's guaranteed that a retry will not insert duplicates. If ClickHouse returns a "retriable" HTTP status code (in particular, 429, 502, or 503) the query will be retried up to the "retries" configuration value.

As @simpl1g suggested, we don't retry database errors. In theory we could -- the difficulty is in determining what ClickHouse errors are retriable. First, ClickHouse doesn't currently return a separate error code -- it has to be parsed from the content of the error message. Second, there are something like 1000 ClickHouse error codes and there's no definitive list of what is "retriable" and what is not.

That said, this is a potential feature request -- which is probably most appropriate for the underlying drivers (clickhouse-connect and clickhouse-driver).

andaag commented 1 year ago

But how do you know if some part of your data is already inserted or not? Making retries in a database that does not have transactions is pretty dangerous, you can end up with duplicates. We had a lot of problems until we disabled retries in code.

By specifically retrying error codes where we know it hasn't been applied. We do this with other services today, and based on the description in the readme that's what I assumed this project did too!

First, ClickHouse doesn't currently return a separate error code -- it has to be parsed from the content of the error message. Second, there are something like 1000 ClickHouse error codes and there's no definitive list of what is "retriable" and what is not.

Probably makes sense to do this on a driver level, and have a whitelist of error codes that are known to be retriable. 👍

stephen-up commented 6 days ago

Hi, I also think more errors should be retried by DBT.

Im getting ClickHouse exception: The read operation timed out at the end of our 120 model pipeline that was running for 5 hours. Since there is no per statement retry its restarting the whole thing.