.NET 5.0 | Implement SqlException.IsTransient

roji commented 4 years ago

https://github.com/dotnet/runtime/issues/34817 introduces DbException.IsTransient for .NET 5.0, database-agnostic transient error detection. Note that this is orthogonal to SqlClient's own resilient connection mechanism - IsTransient is only about surfacing error transience to external retry strategies/policies such as Polly.

Note that EF Core's SQL Server provider has a transient exception detector, this could be a starting point for implementing this within SqlClient. The goal would eventually be for EF Core to rely on the new IsTransient property rather than contain specific knowledge on SqlClient error codes.

KalleOlaviNiemitalo commented 4 years ago

Please consider adding some of these that EF Core's SqlServerTransientExceptionDetector does not recognize:

Number	Severity	Message	Reasoning
601	12	"Could not continue scan with NOLOCK due to data movement."	Advice in Hints (Transact-SQL) - Table
617	20	"Descriptor for object ID %ld in database ID %d not found in the hash table during attempt to unhash it. A work table is missing an entry. Rerun the query. If a cursor is involved, close and reopen the cursor."	"Rerun"
669	22	"The row object is inconsistent. Please rerun the query."	"Rerun"
921	14	"Database '%.*ls' has not been recovered yet. Wait and try again."	"Try again"
1203	20	"Process ID %d attempted to unlock a resource it does not own: %.*ls. Retry the transaction, because this error may be caused by a timing condition. If the problem persists, contact the database administrator."	"Retry"
1204	19	"The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions."	"Rerun"
1221	20	"The Database Engine is attempting to release a group of locks that are not currently held by the transaction. Retry the transaction. If the problem persists, contact your support provider."	"Retry"
1222	16	"Lock request time out period exceeded."	"Time out"
3935	16	"A FILESTREAM transaction context could not be initialized. This might be caused by a resource shortage. Retry the operation. Error code: 0x%x."	"Retry"
3960	16	"Snapshot isolation transaction aborted due to update conflict. You cannot use snapshot isolation to access table '%.ls' directly or indirectly in database '%.ls' to update, delete, or insert the row that has been modified or deleted by another transaction. Retry the transaction or change the isolation level for the update/delete statement."	"Retry"
3966	17	"Transaction is rolled back when accessing version store. It was earlier marked as victim when the version store was shrunk due to insufficient space in tempdb. This transaction was marked as a victim earlier because it may need the row version(s) that have already been removed to make space in tempdb. Retry the transaction"	"Retry"
8628	17	"A time out occurred while waiting to optimize the query. Rerun the query."	"Rerun"
8645	17	"A timeout occurred while waiting for memory resources to execute the query in resource pool '%ls' (%ld). Rerun the query."	"Rerun"
8651	17	"Could not perform the operation because the requested memory grant was not available in resource pool '%ls' (%ld). Rerun the query, reduce the query load, or check resource governor configuration setting."	"Rerun"
9515	16	"An XML schema has been altered or dropped, and the query plan is no longer valid. Please rerun the query batch."	"Rerun"
10922	16	"%ls failed. Rerun the statement."	"Rerun"
14355	16	"The MSSQLServerADHelper service is busy. Retry this operation later."	"Retry"
17197	16	"Login failed due to timeout; the connection has been closed. This error may indicate heavy server load. Reduce the load on the server and retry login.%.*ls"	"Retry"
20041	16	"Transaction rolled back. Could not execute trigger. Retry your transaction."	"Retry"

Also, if SqlClient does not recognize the specific error code, perhaps it can still treat some of the Database Engine Error Severities as transient; like 13 (deadlock) or 17 (out of resources).

KalleOlaviNiemitalo commented 4 years ago

EF Core's SqlServerTransientExceptionDetector returns true if it finds at least one transient error in the SqlException.Errors collection. Should SqlException.IsTransient instead treat some errors as always permanent, so that it returns false if the collection contains at least one always-permanent error, regardless of whether the collection also contains transient errors? Severity 15 (T-SQL syntax error) could be a candidate for that.

roji commented 4 years ago

@KalleOlaviNiemitalo that's great, thanks for this info! Your suggestion below also makes sense to me - if there's any non-transient error in the SqlException, that means there's no use to retry since that error will happen again, so we may as well return false from IsTransient.

/cc @ajcvickers for the error codes missing in the EF Core SQL Server provider.

KalleOlaviNiemitalo commented 4 years ago

To be clear, I do not suggest returning false from SqlException.IsTransient if SqlException.Errors contains two errors, of which one is recognized as transient and the other is not recognized at all. The suggestion was for errors that SqlClient specifically recognizes as always-permanent.

For example, if an SqlException contains the following two errors, then SqlException.IsTransient would be true, because there is at least one transient error and no always-permanent errors:

921 recognized as transient "Database '%.*ls' has not been recovered yet. Wait and try again."
4060 not recognized "Cannot open database "%.*ls" requested by the login. The login failed."

roji commented 4 years ago

The suggestion was for errors that SqlClient specifically recognizes as always-permanent.

I tend to see it the other way - if there's a chance that an error could be indicating a transient problem, then that's a good reason for surfacing it as transient. From this discussion on MySQL:

Which specific errors are classified as transient is of course a provider decision. However, the way I think about it is to err on the side of false positives, rather than conservatively erring on the side of false negatives. A false positive would typically mean a needless retry (which doesn't seem critical), whereas a false negative would mean throwing where a retry may have result in success.

See also the note in the IsTransient proposal about the property being "optimistic".

So I'd say, if there's a single error in the exception which is known to not be transient by nature, the entire exception should report IsTransient=false; otherwise, it should return true. A pragmatic approach may make sense here - we should keep in mind that the use of this is to trigger retrying by an upper layer.

/cc @bgrainger @fransbouma @ajcvickers

KalleOlaviNiemitalo commented 4 years ago

So I'd say, if there's a single error in the exception which is known to not be transient by nature, the entire exception should report IsTransient=false; otherwise, it should return true.

In that case, SqlException won't need a list of transient error numbers at all, as it will only recognize errors that are "known to not be transient by nature".

roji commented 4 years ago

At least for PostgreSQL (which I maintain), the (vast) majority of errors are still non-transient, so it seems to make sense to have a list for errors that are known to be (possibly) transient. That may not be the case for SQL Server, though I'd be surprised.

ajcvickers commented 4 years ago

@AndriySvyryd has traditionally been the steward of which codes we treat as transient.

cheenamalhotra commented 4 years ago

Hi @KalleOlaviNiemitalo @roji

So I'd say, if there's a single error in the exception which is known to not be transient by nature, the entire exception should report IsTransient=false; otherwise, it should return true. A pragmatic approach may make sense here - we should keep in mind that the use of this is to trigger retrying by an upper layer.

I agree with this too!

In the initial phase, the list of transient errors that the driver would recognize as "transient" is driven by Azure's documented error numbers: Transient Fault Error Codes as currently SqlClient retries on.

The error messages documented above are good recommendations, but as client drivers work close to SQL Server/Azure Standards, only errors listed "transient" by server are handled implicitly in SqlClient. ORM frameworks like EF Core posses the flexibility to implement their own retry policy based on experiences and application errors, as it's an external extender.

We will take them into discussion when we work on this activity internally in order to provide best experience to customers, but we also would need to abide by driver policies to reflect server-side transient policies.

Wraith2 commented 4 years ago

It doesn't sound like there is consensus on what a transient exception is and if that's the case adding the feature without a way to configure it sounds like setting ourselves up for years of issues of people saying one thing or another should be transient. To avoid that could we add something at a process level (so at provider level in the api) which lets users add whatever we end up needing to determine if an error is transient? If other providers end up needing the same functionality the implementations could be promoted to a common api in .net 6.

AndriySvyryd commented 4 years ago

I want to sidestep the question of what should be considered a transient error for now to focus on the biggest advantage of this API. In the past SQL Azure has changed the list of transient errors requiring changes to the application, EF Core or SqlClient to change the retry policy accordingly. If SqlClient implements SqlException.IsTransient by delegating to the server (when that's where the error originated) then such behavioral changes won't be breaking anymore.

A related useful API would be a TimeSpan RetryAfter property that provides a baseline for the recommended delay before retrying depending on the error, but this would need to be provided by the server to give the most benefit.

Wraith2 commented 4 years ago

But if we delegate to the server to decide if something is transient how do we know if inability to connect to the server is transient... 😀 More seriously is there something in sql that we can query to get a list of transient exceptions?

roji commented 4 years ago

@Wraith2

It doesn't sound like there is consensus on what a transient exception is and if that's the case adding the feature without a way to configure it sounds like setting ourselves up for years of issues of people saying one thing or another should be transient.

I think that's giving up a bit quickly here; this is definitely something we need to discuss thoroughly in order to arrive at a good definition, but I think that's certainly possible. I don't believe transience is a concept that really varies that much, and pragmatically speaking I think most applications are going to have the same needs around this.

EF Core's transient error detector could be considered a pretty good proof of this; the built-in retrying strategy identifies the codes it identifies, and we've rarely had users arguing for a different concept of transience. What we have seen, as @AndriySvyryd wrote, is changes in actual error codes. Note also that the EF Core detector has been copied and pasted for use outside of EF Core (e.g. https://gist.github.com/hyrmn/ce124e9b1f50dbf9d241390ebc8f6df3#file-sqlservertransientexceptiondetector-cs), which is exactly why I think this kind of knowledge should go into SqlClient.

To avoid that could we add something at a process level (so at provider level in the api) which lets users add whatever we end up needing to determine if an error is transient?

The whole point of this API is for the DB provider to report to an upper layer (Polly, EF Core, LLBLGen Pro) what it considers to be transient for the general, common case. If an application wants to customize which errors are considered transient, it can simply do that at that upper layer, i.e. defining a Polly/EF Core retrying strategy that does whatever the user wants. In other words, I don't think it makes any sense for users to call a SqlClient API which changes what SqlClient reports in IsTransient, when that IsTransient API is itself called by the user (via Polly/EF Core). If you want to control what's considered transient and what isn't, that's totally fine - you can just customize your Polly/EF Core policy and leave SqlClient out of it.

I also very much agree with @AndriySvyryd on the problem of changing error codes in SQL Azure.

Wraith2 commented 4 years ago

Ok,

rcmosher commented 1 year ago

I just want to add it is very frustrating to have an exception saying "An exception has been raised that is likely due to a transient failure.", a documented IsTransient property on SqlException, but then that property always returns false two years after this issue is opened. There should be a way to detect errors reported as transient without reading the exception message. And documentation should indicate the property isn't supported. It looks like the current solution is to copy SqlServerTransientExceptionDetector.ShouldRetryOn

Wraith2 commented 1 year ago

Have you considered trying to add the feature? the codebase is in the repository.

tbolon commented 1 year ago

@rcmosher @Wraith2 I just submitted a PR containing a copy of the most recent EFCore implementation.

Feel free to comment on this PR.

stevendarby commented 6 months ago

The goal would eventually be for EF Core to rely on the new IsTransient property rather than contain specific knowledge on SqlClient error codes.

My concern, based on the status of the current PR, is that it's going to be a lot harder to maintain this list in SqlClient than in EF Core :(

sebok commented 3 months ago

What about false positive results?

This SQL also produces error code 121, but it's a syntax error:

CREATE TABLE #MyTempTable
(
    Col1 INT,
    Col2 VARCHAR(255)
);

INSERT INTO #MyTempTable(Col1, Col2) SELECT Col1, Col2, Col2 AS ColTooMany FROM #MyTempTable ;

I've seen many implementations which contains error code 121, but no one cares about it. The problem is that SQL error codes are conflicting with Win32 error codes.

I think it also affects error codes 64, 121, 233, 10053, 10054 and 10060

Checking the inner Win32Exception can be a solution, but also not perfect in the case of multiple errors.

dotnet / SqlClient

.NET 5.0 | Implement SqlException.IsTransient #649