Intermittent Unknown error 258 with no obvious cause

deadwards90 commented 2 years ago

Describe the bug

On occasions we will see the following error

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)

However, SQL Server shows no long running queries and is not using a lot of it's resources during the periods where this happens.

It looks to be more of an intermittent connection issue but we're unable to find any sort of root cause.

To reproduce

We're not sure of the reproduction steps. I've been unable to reproduce this myself by simulating load. From what we can tell this is more likely to happen when the pod is busy (not through just HTTP, but handling events from an external source) but equally it can happen randomly when nothing is really happening on the pod which has caused us quite a substantial amount of confusion.

Expected behavior

Either more information on what the cause might be, or some solution to the issue. I realise the driver might not actually know the issue and it may really be a timeout to it's point of view. We're not entirely sure where the problem lies yet, which is the biggest issue.

Further technical details

Microsoft.Data.SqlClient version: 3.0.1 .NET target: Core 3.1 SQL Server version: Microsoft SQL Azure (RTM) - 12.0.2000.8 Operating system: Docker Container - mcr.microsoft.com/dotnet/aspnet:3.1

Additional context

Running in AKS, against Elastic Pools.
SQL Server shows no long running queries
We sometimes get a TimeoutEvent from the metrics that are collected from the pool. On occasions when we do get them, the error_state will be different.
- For example, we had one this morning that was 145. We don't know what this means can find no information on what these relate to. I've raised a ticket with the Azure Docs team to look at this. I'll add more onto this when they happen as we've not been keeping track of the error_state codes as we're not sure if they're even relevant.
This might be related to this ticket - https://github.com/dotnet/SqlClient/issues/647
- However we don't see the ReadSniSyncOverAsync
We do have event counter metrics being exported to Prometheus but have found no obvious indicators that something is wrong

sajidur commented 11 months ago

I encountered a similar issue where my application was running on Docker within a Red Hat VM. I faced several weeks of challenges while trying to resolve it. Eventually, I decided to switch to a Windows environment, and to my surprise, I encountered no further issues.

I suspect that there might have been a network issue within my Linux VM with the windows SQL Server database.

SharePointX commented 11 months ago

I am having the same problem.

System.Data.SqlClient version: 4.8.5
.NET target: .NET 8.0
SQL Server version: Microsoft SQL Server 2019 Developer (64-bit) - 15.0.2101.7
Operating system: Windows Server 2019 Datacenter (10.0)

System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): 待ち操作がタイムアウトになりました。
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite, String method)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.FillInternal(DataSet dataset, DataTable[] datatables, Int32 startRecord, Int32 maxRecords, String srcTable, IDbCommand command, CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.Fill(DataTable[] dataTables, Int32 startRecord, Int32 maxRecords, IDbCommand command, CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.Fill(DataTable dataTable)

The table has 10 columns. Total number of data is 22,000,000. When I call "Fill" with PK by SqlCommand, the error occurs only the first time. On the second call, the correct 10,000 items are retrieved.

It occurs in single-threaded console applications, even when accessed from a single instance. I believe the problem is not a client application problem, but a database cold start problem. Is there any way to warm up the database?

JRahnama commented 11 months ago

@SharePointX have you tried Microsoft.Data.SqlClient? Is this constantly happen on your application? Can you share a sample repro?

SharePointX commented 11 months ago

I swapped to Microsoft.Data.SqlClient to check and still got the same error. However, it does not occur on the second run.

1st run: timeout error
2nd time: Normal termination immediately
After restarting SQL Server service: takes a long time but ends normally

The other thing I found out is that the error occurs only when querying a table with a very large number of records in the same DB instance.

DavoudEshtehari commented 11 months ago

@SharePointX have you tried increasing the connection timeout to make sure it won't be timed out while retrieving data from the server and how it affects the execution?

I'd suggest retrieving data in smaller batches if it fails with large portions to find the balance. If the server is not available temporarily (especially over the network), I'd recommend doing the retry pattern on the the application level with the remaining data.

There are some thread starvation scenarios with thread pool and connection pool, which I believe don't apply to your case with a single-threaded app. I don't have any idea about warming up the database, you may ask the SQL Server Dev team.

oyvost commented 10 months ago

I revisited this problem and stumbled upon something intriguing while using EF Core 7.0. I discovered that both DbConnection and DbCommand have a CommandTimeout property.

Here's how I configured the database service:

builder.Services
    .AddDbContext<DBContext>(options =>
    {
        var connectionString = "----";
        options.UseSqlServer(connectionString, sqlServerOptionsAction: sqlOption =>
        {
            sqlOption
                .CommandTimeout((int)TimeSpan.FromMinutes(30).TotalSeconds);
        });
    });

When calling the stored procedure, I had to include the following code:

_db.Database.OpenConnection();
var command = _db.Database.GetDbConnection().CreateCommand();
command.CommandText = "StoredProc";
// command.CommandTimeout = _db.Database.GetCommandTimeout().Value; // The command timeout is not set without this line
return await command.ExecuteNonQueryAsync();

I initially assumed that DBCommand.CommandTimeout was inherited from DBConnection.CommandTimeout, but that's not the case. The line mentioned above must be uncommented. After doing so, the timeout issue (258 timeout) is resolved. This behavior was somewhat unexpected. Is there a better way to handle this configuration?

ErikEJ commented 10 months ago

@oyvost You can set it in the connection string.

L-Palmquist commented 10 months ago

To continue the conversation about this issue being related to Linux and to hopefully narrow in on an event ... our Azure Application Insight logs show that these random timeouts AND application performance issues all started after this Azure East US maintenance window QK0S-TC8 did SOMETHING to Azure functions host on August 17th 9:00PM EST. Something in that host update caused this unknown error 258 to start appearing.

At that point in time, our host went from Linux 5.10.177.1-1.cm1 to Linux 5.15.116.1-1.cm2 and application performance tanked shortly after, and we now have the sudden appearance of these Unkown error 258 throwing exceptions. Some metric digging shows that memory usage (P3V3 with 32GBs) tanked along with performance.

No code or db changes on our part, just suddenly the AppInsight logs show kernel 5.15, odd timeout errors, and we presumably can't access memory. @eisenwinter - you said you tried a few different kernels, did you go back to 5.10? Anyone else seeing this issue on 5.10 or lower?

Update: We converted the function app over to Windows. The sql timeout errors are gone and application performance is restored! 🎉

Just want to mention that we experienced a similar issue and converting to Windows seems to have solved the problem.

dazinator commented 10 months ago

I can't switch to windows to solve this. I added polly retries with exponential backoff and on the third retry when this error occurs, the query succeeds. This has allowed us to remain operational thank goodness.

dazinator commented 9 months ago

Would it be possible to hear from the SqlClient team as to whether any fixes in the latest release (5.1.2) may help with this issue? https://github.com/dotnet/SqlClient/blob/main/release-notes/5.1/5.1.2.md

dbeavon commented 8 months ago

If you have chronic problems or outages, please open a support ticket with Microsoft and share the outcome here. I think that type of information will allow us to proceed more productively.

I'm a "big-data" developer in Azure, and I have suffered from chronic socket exceptions in multiple Azure platforms - just as everyone else is describing. The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM.

The problems are related to bugs in Microsofts software-defined-networking components (SDN proxies like "private endpoints" or "managed private endpoints"). I will typically experience these SDN bugs in a large "wave" that impacts me for an hour and then goes away. The problems have been trending for the worse over the past year, and I've opened many tickets (Power BI, ADF, and Synapse Spark).

The problems are not likely to happen on normal TCP (public internet ports). Nor are they common on VNET traffic when "peering" is used. It seems to be a problem that is specific to the funky private endpoints in Azure. I would be interested to know if others can related to this pattern? Also, the socket exceptions are not specific to SQL. They impact MSAL, ADLS, REST, etc. The underlying TCP exceptions are generally the same in all cases ("connection reset" or "connection timeout").

The Microsoft support cases I have already opened have not gone well. We are a small customer and Microsoft has not taken our problems seriously. I think they are already well-aware of the bugs in their private VNET technology and are trying to keep them hidden. They undermine the severity of these problems by repeating the word "transient" over and over like it was some sort of a magical incantation that makes the bugs more acceptable. To me it seems totally unacceptable to have issues (transient or otherwise) that trigger dozens and dozens of socket exceptions a day.

I find that there are lots of Microsoft's own products that are impacted by these network bugs, notably ADF pipelines. They have had a "VNET IR" that suffers from a massive number of socket exceptions when interacting with resources in the local network (East US). The support teams actually say they have a formalized "policy" to require customers to implement retries on all pipeline activities (they suggest at least three retries with two minute delays in between them). This "policy" is simply a way to avoid the well-known bugs. ADF deployed the VNET IR several years ago and they have probably handled hundreds of CSS tickets about the fragile private endpoints in their managed vnet.

IMPORTANT: Based on a three-year support case with another team (not ADF) I do have bug numbers that apply to private endpoints and these bugs are supposed to be fixed soon (not holding my breath).

The following changes are for the "Azure Network stack":

Feature ending with x509 : Support map space deletion in VFP without breaking flows. (virtual filtering platform, ie switch?)

The ETA publicized is late 2024

Bug ending with x810 : PE-NC Flow mix up Network team promised to roll it out early 2024.

Sorry for the long post. I think it is easy for everyone to prove that these problems are specific to Azure, and don't affect on-premise scenarios with Sql Client. The challenge is getting Microsoft to help with them in Azure. I get the sense that there is a substantial lack of honesty and transparency when it comes to networking bugs in Azure. There was only one support scenario where Microsoft was willing to admit to the bugs I've experienced in their SDN technology. In every other Azure platform they are refusing to investigate or discuss the so-called "transient" network errors that cause TCP connections to break between resources that are in direct proximity to each other (0 ms latency).

Here is a discussion of ADF's networking bugs. https://learn.microsoft.com/en-us/answers/questions/1390627/what-is-the-reason-for-networking-problems-in-adf

Hope this helps. I would appreciate it if those impacted by the socket errors would at least confirm that their scenarios involve VNETS, eg. private endpoints in particular, and possibly managed vnets. If you open a support case with Microsoft, please keep a count of the number of times they use the word "transient" (and they always say "communication failure" rather than socket exception).

jbogard commented 8 months ago

If you have chronic problems or outages, please open a support ticket with Microsoft and share the outcome here. I think that type of information will allow us to proceed more productively.

We did this. It was punted around between teams with no resolution so we closed it on our side because we were tired of spending time answering the same questions over and over. The workarounds posted here have helped but not eliminated the issue. We decided to stick with *nix App Service instances although the time spent futzing around on this issue has cost us more than the savings switching from Windows instances.

dazinator commented 8 months ago

The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM.

@dbeavon We have a multi-tenant application running as a docker container on an Azure VM - connecting to Azure SQL over a "private endpoint" (which presents itself as a nic on our vnet). So yes I think this is a good match.

I have not tried deleting the private endpoint and going via the public ip, i will give that a shot. As it stands, we are mid plan refactoring our application to make our most impacted SQL calls in a newly dedicated process (theory being it was something to do with thread pool ramp up when other things are sharing the thread pool - and this not playing well with the was sql client library is written using sync over async and such). So by giving our task a dedicated dotnet process hopefully that will help. However if it doesn't, aside from being really annoyed, atleast I can give the public IP a go..

It's too painful to go through support, we are a small team and the level of time required to get this over the line appears too much.

dbeavon commented 8 months ago

@dazinator I certainly agree about the pain of support. But if Microsoft isn't working on these sorts of cases, then they will have less motivation to fix the underlying bugs . I doubt there is any telemetry that is captured for the so-called "transient" socket exceptions in their virtualized network technology. If only 1% of the impacted customers are making the effort to report these problems then Microsoft probably thinks the problems aren't very severe.

And if Microsoft won't prioritize or fix these network bugs, they will impact us forever. Just as a case in point - the managed-vnet-IR in ADF has been buggy for ~4 years. And the managed vnet gateway in Power BI is also impacted by the same private endpoint bugs for ~3 years (the only difference here is that the Power BI team wisely allowed their component to remain in a "preview" state rather than pushing ahead with a premature GA). I have had a case open with the PBI team for the entire time this component has been in preview.

The biggest pain for me has been with the Synapse platform. Hosting my Spark workloads on Synapse has been a real nightmare. I see dozens and dozens of socket exceptions each day, in every possible type of remote connection. However this did NOT happen to me when my Spark workloads were hosted on Azure Databricks (using an injected VNETs, rather than private endpoints). Nor do I see these socket exceptions when running a standalone Spark cluster on my own local hardware. It is pretty easy to distinguish the network environments that work fine from the ones that are bug-ridden. There is no excuse for the unreliable nature of Microsoft VNETs. The TCP transport protocol is already designed to be fault-tolerant. Microsoft has really changed the whole equation when it comes to the reliability of TCP connections....

It is especially annoying to think about the fact that these failed connections are taking place between servers that are sitting a hundred yards from each other (or less). ... I will have to find the Azure East US region on google earth to see how big it is. It is certainly not big enough to suffer from "random Internet glitches". I would have better network reliability running my Spark clusters on the Wifi at an airport than I would in Azure!

TroyWitthoeft commented 8 months ago

@dbeavon - I know how painful and impossible it can be to track down transient network errors. I bet the MS team does too, hence the run around. Transient errors stink, I feel your pain.

I do want to point out that this particular situation could or could NOT be related to network errors. We experienced that the Intermittent Unknown error 258 began with an update to our Linux App Service that clearly changed it's operating telemetry and then this error appeared.

We'll all need to work together to get this one sorted.

sliekens commented 8 months ago

It doesn't help that 258 seems to be used for a wide range of problems. Your query is slow? Unknown error 258. The network is congested? Unknown error 258. The DBA didn't like the way you looked at him? UNKNOWN ERROR 258.

Vandersteen commented 8 months ago

The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM.

@dbeavon We have a multi-tenant application running as a docker container on an Azure VM - connecting to Azure SQL over a "private endpoint" (which presents itself as a nic on our vnet). So yes I think this is a good match.

I have not tried deleting the private endpoint and going via the public ip, i will give that a shot. As it stands, we are mid plan refactoring our application to make our most impacted SQL calls in a newly dedicated process (theory being it was something to do with thread pool ramp up when other things are sharing the thread pool - and this not playing well with the was sql client library is written using sync over async and such). So by giving our task a dedicated dotnet process hopefully that will help. However if it doesn't, aside from being really annoyed, atleast I can give the public IP a go..

It's too painful to go through support, we are a small team and the level of time required to get this over the line appears too much.

We went the other way, from public IP to private endpoint, if my memory serves me, it was after some advice we received from azure support when we asked to investigate it. It did not help much. What did seem to help (a bit, we still get frequent timeouts / 258), is to have less applications / concurrency on a single 'vm' by making our 'apps' reserve more cpu's (on k8s) even thought they do not really need it when looking at the cpu usage.

dbeavon commented 8 months ago

It would be very easy to send a repro to one of these Azure teams, but they are always unwilling to pursue that option. I have a couple tickets open at the moment where I offered to give the CSS engineers a repro in order to experience these problems for themselves, and they are totally uninterested in taking that path.

(It is strange to behold a vendor that doesn't care about fixing their bugs. When using on-prem software, vendors are typically VERY appreciative if you could provide a consistent reproducible that allowed them to independently review a bug on demand.)

In my experience this problem isn't specific to SQL. In my Synapse-Spark clusters I have chronic socket failures across a number of remotely accessed services in East-US (MSAL, REST, SQL, etc). I just finished up with a support case involving the ADF team whereby they were experiencing a similar problem in their own component (ADF "Linked Service Resolver"). Their component was very unreliable as it was making connections to MSAL from the managed vnet. This particular MS team (that is responsible for the ADF "Linked Service Resolver") says the solution is that every TCP socket which is opened in Azure needs to be wrapped in retry loops that are repeated five times with two minute delays between them. I'm not sure about you folks but it seems absurd. The TCP protocol is supposed to be resilient and fault tolerant, so that application developers don't need to re-invent the wheel and wrap a ton of custom retry logic around every remote TCP connection.

In the end, the ADF team "mitigated" their own component ("Linked Service Resolver") in some way - presumably by using a custom-written retry loop. Then they closed the case without giving any explanation for the elevated number of socket exceptions that were affecting us in the first place. I had hoped to learn from the experience and apply the information to the other types of failing sockets, but this ADF team really didn't give me anything useful as a take-away. Aside from my chronic socket exceptions in ADF/Synapse, I also have frequent outages that are hitting my service-health dashboard. The outages last thirty minutes at a time and come up on a monthly basis, and these outage notifications often use the word "transient communication failures" to explain the RCA. In other words, I think Microsoft product teams are suffering from the network problems themselves, even to the point of publicizing them when they become extreme. Hopefully they will eventually fix these problems for their own sake, (if not for ours.)

Will-Bill commented 7 months ago

For those moving to PostgreSQL to mitigate these issues, what are you using for the migration from MS SQL to PostgreSQL?

Malcolm-Stewart commented 7 months ago

Are you running into the Azure Load Balancer issue, where if the connection goes idle for more than 5 seconds, then the connection is terminated? Normal SQL keep-alive is 30 seconds. Lowering it to 3 seconds avoids the issue by ensuring there is never a gap of 5-seconds without some sort of traffic, even if it is a Keep-Alive packet.

A cold SQL Server may be prone to having periodic pauses in getting/returning data, whereas a warmed-up server should have it in cache. This may explain the issues you are running into.

You should be able to see this using a network trace.

asierpn commented 7 months ago

The issue can be easily reproduced by running an aspnet core web api project in a Linux container and making multiple concurrent calls to an endpoint to query a SQL Server database immediately after the service starts. The cost of opening each new thread on Linux appears to be very high, and making 30 concurrent queries with no available threads in the thread pool can take more than 60 seconds. If you do the same test running the service on Windows, it works perfectly.

The only workaround for us to get this to work on Linux is to set the ThreadPool.MinThreads to a high number.

damodar12345 commented 6 months ago

@asierpn @dazinator @MichelZ Could you please let me know what was the value of min threads you had set as a workaround?

asierpn commented 6 months ago

@asierpn @dazinator @MichelZ Could you please let me know what was the value of min threads you had set as a workaround?

It depends on your application, you have to test with different values to find the best one for you, we noticed an improvement when we set it to 20, but eventually we increased it to 100.

deadwards90 commented 6 months ago

@Malcolm-Stewart I'll try to answer some of your questions, I'm going from memory here as I raised this issue some time ago.

Are you running into the Azure Load Balancer issue, where if the connection goes idle for more than 5 seconds, then the connection is terminated? Normal SQL keep-alive is 30 seconds. Lowering it to 3 seconds avoids the issue by ensuring there is never a gap of 5-seconds without some sort of traffic, even if it is a Keep-Alive packet.

I don't believe we were going through an Azure Load Balancer for this. Given what others in this thread have suggested, I don't think this would be what I was seeing but it might be what others have experienced (see also the port exhaustion suggestions from early on).

A cold SQL Server may be prone to having periodic pauses in getting/returning data, whereas a warmed-up server should have it in cache. This may explain the issues you are running into.

I can say we some confidence this was not the issue. These errors happened sporadically, on hot or cold servers.

You should be able to see this using a network trace.

Certainly would be nice but the issue we always had was that these were running on containers in K8s and the errors were not easily reproduced. We'd have to leave network traces running for hours or days at a time to try and catch one in the act. There is @asierpn above who recently said this can be fairly easily produced just by running in a Linux Container and querying (I assume?) any SQL Server instance just as the service is starting

With that said, we witnessed this on both busy and fairly dormant services that had been up and running for days at a time so it might be the same error message but a different root cause, very hard to say though.

If there is anything we (the community) could do here to help nail this down, I'm all ears. Happy to have something running on a machine over a weekend to try and catch it along with some tracing but I'm not sure the best way to set this up.

David-Engel commented 6 months ago

Just want to clarify one misconception.

The cost of opening each new thread on Linux appears to be very high, and making 30 concurrent queries with no available threads in the thread pool can take more than 60 seconds.

While the cost of creating new threads is higher on Linux than on Windows, for example. It's not that cost that slows things down. The system thread pool implementation in .NET is designed to spin up new threads slowly, in a very conservative manner. If a library or application is spawning many tasks that hold threads for any length of time, it will take a while for the system thread pool to add threads to service the waiting queue of tasks.

DerPeit commented 3 months ago

In our project, we also discovered that issue quite often, recently. After weeks of debugging, I was able to deterministically reproduce this issue. I am not quite sure if this is the cause for any of the incidents mentioned above, though.

In our case the issue occurs with Entity Framework when the following criteria is met:

A long-running query is being executed
The query will not finish within the configured command timeout
Iteration will be stopped, prematurely:

void Main()
{
    const string longRunningQuery =
        """
        WITH TenRows AS (
            SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
        ),

        ThousandRows AS (
            SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
        )

        SELECT A.A
        FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
        """;

    var context = new TestContext();

    context.Database.SetCommandTimeout(1);

    var values = context.Database.SqlQueryRaw<int>(longRunningQuery);

    foreach (var value in values)
    {
        break;
    }
}

// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
    }
}

The reason is that the underlying data reader is being disposed (closed) when the enumerator is being disposed. This does not cancel the underlying db command. See also: https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.sqlclient.sqldatareader.close?view=sqlclient-dotnet-standard-5.2#remarks

Interestingly, querying the data to completion may take longer than the configured command timeout in total. But as soon as the reader is being closed the timeout seems to kick in. In the example above, the instruction pointer will "hang" on the closing foreach bracket for the amount of time configured.

The code above can be rewritten to something which is similar to:

const string longRunningQuery =
    """
    WITH TenRows AS (
        SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
    ),

    ThousandRows AS (
        SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
    )

    SELECT A.A
    FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
    """;

var connection = new SqlConnection("Server=localhost; Integrated Security=True");
var command = connection.CreateCommand();

command.CommandText = longRunningQuery;
command.CommandTimeout = 1; // seconds

connection.Open();

var reader = command.ExecuteReader();

// Closing/Disposing the reader will drain all pending records. If the time this takes,
// starting from the close attempt, will exceed the command timeout, error 258 will be thrown.
// If the associated command would have been canceled, the code would work fine:
//
//command.Cancel();
//
// Remarks: If we read the reader from user code regularly, the timeout won't have any
// effect. At least when the first record is being delivered in time:
//
//while (reader.Read());

reader.Close();

Thus, in my opinion, the issue is related with https://github.com/dotnet/efcore/issues/24857. Although the underlying issue is ADO.NET related, OR mappers like EF Core IMO must address this somehow, as it lacks an API to actually cancel the query.

Our scenarios, for example, are a little more complicated. We attempt to stream out data from the database over HTTP to a client while performing some transformation. When this kind of transformation fails for some reason, this leads to the error and even hides the original exception due to the underlying finally block:

async Task Main()
{
    await StreamToClient(CancellationToken.None);
}

async Task StreamToClient(CancellationToken cancellationToken)
{
    const string longRunningQuery =
        """
        WITH TenRows AS (
            SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
        ),

        ThousandRows AS (
            SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
        )

        SELECT A.A
        FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
        """;

    var context = new TestContext();

    context.Database.SetCommandTimeout(1);

    var records = context.Database.SqlQueryRaw<int>(longRunningQuery).AsAsyncEnumerable();

    await foreach (var record in records.WithCancellation(cancellationToken))
    {
        var transformedRecord = await Transform(record, cancellationToken);

        // Send to client
    }
}

async Task<string> Transform(int record, CancellationToken cancellationToken)
{
    await Task.Yield();

    throw new InvalidOperationException();
}

// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
    }
}

Even cancellation triggers the error, although it is internally bound to cancelling the db command eventually:

async Task Main()
{
    using var cts = new CancellationTokenSource();

    cts.CancelAfter(TimeSpan.FromSeconds(2));

    await StreamToClient(cts.Token);
}

async Task StreamToClient(CancellationToken cancellationToken)
{
    const string longRunningQuery =
        """
        WITH TenRows AS (
            SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
        ),

        ThousandRows AS (
            SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
        )

        SELECT A.A
        FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
        """;

    var context = new TestContext();

    context.Database.SetCommandTimeout(1);

    var records = context.Database.SqlQueryRaw<int>(longRunningQuery).AsAsyncEnumerable();

    await foreach (var record in records.WithCancellation(cancellationToken))
    {
        var transformedRecord = await Transform(record, cancellationToken);

        // Send to client
    }
}

async Task<string> Transform(int record, CancellationToken cancellationToken)
{
    await Task.Yield();

    return record.ToString();
}

// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
    }
}

If we add an interceptor cancelling the command on reader close, the issue goes away, immediately:

void Main()
{
    const string longRunningQuery =
        """
        WITH TenRows AS (
            SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
        ),

        ThousandRows AS (
            SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
        )

        SELECT A.A
        FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
        """;

    var context = new TestContext();

    context.Database.SetCommandTimeout(1);

    var values = context.Database.SqlQueryRaw<int>(longRunningQuery);

    foreach (var value in values)
    {
        break;
    }
}

// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
        optionsBuilder.AddInterceptors(new CommandCancelingInterceptor());
    }
}

public class CommandCancelingInterceptor : DbCommandInterceptor
{
    public override InterceptionResult DataReaderClosing(DbCommand command, DataReaderClosingEventData eventData, InterceptionResult result)
    {
        command.Cancel();

        return base.DataReaderClosing(command, eventData, result);
    }
}

I saw some people using Dapper. We're using Dapper in our project as well. However, I did not yet attempt to reproduce the issue with Dapper. I am pretty sure, though, that any OR mapper is affected that behaves in the same way.

I hope this helps some of you guys and I'd be interested if the other things you tried might correlate with our observation. We also tried to scale up the database or adjust some screws and it helped. But I think this was due to the query we were executing always "scratched" on the default command timeout of 30 seconds, exceeding it only occasionally. And scaling up the database lead to reader disposal finishing in time.

deadwards90 commented 3 months ago

@DerPeit glad you got to the bottom of your variant of the issue.

I hope this helps some of you guys and I'd be interested if the other things you tried might correlate with our observation

On this point, I think what you saw and what we saw were different issues. We never saw these queries make their way to the database, or at least not that we could find. The queries themselves would not have been long-running (e.g. selecting a single record by ID) and they could happen when the DB was using less than 10% DTUs.

dotnet / SqlClient