dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
849 stars 284 forks source link

Execution Timeout Expired Error (258, ReadSniSyncOverAsync) #647

Closed frankyuan closed 5 months ago

frankyuan commented 4 years ago

Describe the bug

When executing SQL such as SELECT FieldA, FieldB FROM A INNER JOIN C ON A.FieldId = C.FieldId UNION SELECT FieldA, FieldD FROM A INNER JOIN D ON A.FieldId = D.FieldId, throw the error like below, not every time, just a little part of queries have this issue.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\n
---> System.ComponentModel.Win32Exception (258): Unknown error 258\n
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction)\n 
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Span1 buff, Int32 len, Int32& totalRead)\n
at Microsoft.Data.SqlClient.TdsParser.TrySkipValue(SqlMetaDataPriv md, Int32 columnOrdinal, TdsParserStateObject stateObj)\n
at Microsoft.Data.SqlClient.TdsParser.TrySkipRow(SqlMetaDataSet columns, Int32 startCol, TdsParserStateObject stateObj)\n
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n
at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)\n
at Microsoft.Data.SqlClient.SqlDataReader.ReadAsync(CancellationToken cancellationToken)\n
--- End of stack trace from previous location where exception was thrown ---\n
at Dapper.SqlMapper.QueryAsync[T](IDbConnection cnn, Type effectiveType, CommandDefinition command) in //Dapper/SqlMapper.Async.cs:line 437\n

To reproduce

Sorry, currently can't reproduce in the local environment, so can't provide more detail to reproduce.

Expected behavior

SQL should execute successfully every time.

Further technical details

Microsoft.Data.SqlClient version: 1.1.3 .NET target: Core 3.1 Operating system: Docker container

What I found/tried

https://stackoverflow.com/questions/57270245/sql-server-dbcommand-timeout-with-net-core-container-under-load

https://github.com/StackExchange/Dapper/issues/1435

arielmoraes commented 2 years ago

As a workaround, you can bump up server's max worker threads that will allow other pending tasks to occur and close connections that are to be closed.

@cheenamalhotra what if you are running inside Azure SQL Pool and the Workers count is not hitting the limit?

I managed to reproduce the issue with Mars enabled (another issue), but the Unkown error 258 and some others are happening to our application when running in production too. In the same scenario, when the system is under heavy load exceptions start to happen.

We need a fix for that ASAP as we are running financial services.

JRahnama commented 2 years ago

@arielmoraes are you able to provide a sample repro? MARS enabled is probably issue #422 which is a different story, or if you are able to capture EventSource traces that at least shows us where the issue has happened?

arielmoraes commented 2 years ago

@JRahnama I managed to set a sample repro to get the Events when using MARS, will post the results on the other issue.

For this issue, I'll have to enable the event source in production which will generate A LOT of events, if I manage to create a sample repro I'll post the results here.

bennil commented 2 years ago

In my last post I guessed the 258 error occurs because of full tcp buffers and wrong tcp socket handling on linux (see my previous post above). Further investigation made me more confident that this is the case.

The additional queue I added (last post) helped a bit, but after adding more features (service implementations) to our host process the error came back quite frequently.

Because most 258 errors occured in DbContext.SaveChanges() I synchronized calls in my DbContext base implementation like this:

public class MyBaseDbContext : DbContext
{
    private static readonly object syncObj = new object();

    public override int SaveChanges()
    {
        lock (syncObj)
        {
            return base.SaveChanges();
        }
    }
}

This dirty code slowed down my execution flow and helped reducing the 258 errors. But like others reported here about disabling MARS it did not solve the problem completly.

I examined that the problem occurs often during extensive external web API rest calls. So I synchronized them as well:

static SemaphoreSlim semaphoreSlimGeocodeLock = new SemaphoreSlim(1, 1);

public void MyRestCallMethod() 
{
    try
    {
        await semaphoreSlimGeocodeLock.WaitAsync();

        HttpClient httpClient = new HttpClient();
        string result = await httpClient.GetStringAsync(url);
        ...

    }
    finally
    {
        httpClient.Dispose();
        semaphoreSlimGeocodeLock.Release();
    }
}

No parallel web calls helped a lot and made me sure the error is related to the current tcp buffer usage within the application.

I guess the error originates in some way like this:

  1. SqlClient reads/writes to the sql server tcp socket connection.
  2. Because of the buffer handling bug the read/write loop returns indicating the remotehost has terminated the connection.
  3. SqlClient internal reconnect feature reconnects immediately to the SQL Server.
  4. On the next tcp read/write the same error occures because the application tcp buffer is still full.
  5. This is repeated many times and there might be a threading issue playing a role as well.
  6. The application is freezed and timeout "Unknwon error 258" occurs

Slowing down the execution flow and hoping not reaching tcp buffer limits is not a production ready workaround :) I wonder if this is a common problem for all high tcp i/o apps running on Linux and connecting to Sql Server?

This error occured some time after upgrading ef core (not sure which .net core/ef core version). My application runs currently on:

Used SQL Server: Latest SQL Server 2019 Docker container

Side notes:

JRahnama commented 2 years ago

@bennil and those who are using NetCore, if possible, can you test with MultiSubnetFailover=true?

arielmoraes commented 2 years ago

I know it could be a very long shot, but reading this I can't stop thinking, could that be related to continuations trying to execute in the wrong SynchronizationContext?

bennil commented 2 years ago

@JRahnama we added MultiSubnetFailover=true to our connection string. I don't know if we can reproduce the problem correctly because our last change did further minimize parallel work and the error occured then only once during the day. So we need some days to guess if this option changed anything.

@arielmoraes: Interesting thought. I did search our codebase and found some improper use of Task.Result. Can this affect the synchronous db save method? Or did we just reach a parallel limit using MARS and to many db read/writes worker tasks?

niuniu007 commented 2 years ago

最近观察到业务高峰期会有大量的Unknown error 258,在数据库服务器上观测到cpu 100%,是由 sql server windows nt -64bit 占用 98%+ 没有死锁 使用 select * from master.dbo.sysprocesses 查询到连接数在300左右 当前系统有 170左右 当前系统数据库连接串: Data Source=xxxxxxx,15726;Initial Catalog=xxxxx;Persist Security Info=True;User ID=xxxx;Password=xxxxxx;Integrated Security=false;Connect Timeout=30;MultipleActiveResultSets=false

有两个大并发业务连接到这台数据库 因为一些sql语句 较慢,经过优化后,今天再次使用 select * from master.dbo.sysprocesses 查询当前连接在 60左右,已经没有Unknown error 258 的记录了

同时:数据库的最大连接数设置为0

dazinator commented 2 years ago

Without wanting to reveal too much about our infrastructure setup on a public domain, We see this error pretty consistently when load testing a web api, that inserts a large volume of data into an azure sql database.

Details that might hopefully enable a repo:

  1. Web api is running inside linux docker container (asp.net core 5 image) hosted on ubuntu 20.04 vm (azure) with 2 cpus (general purpose)
  2. Azure SQL (serverless, with scale settings: min 0.5 cpu cores, max 2 cpu cores)
  3. The api receives a POST request with bulk data (5000 records) and does an insert using EF Core 5.0.5 sql server package, single call to SaveChanges().
  4. Our load generator, makes 30 concurrent http requests to this api, each request delivering 5000 records for an insert.
  5. We see that roughly 1 or 2 requests consistently fail with an error each time, and this error shows up in the stack trace:

image

At the time of the error, for example, below at 3:18pm, the cpu and io usage of the sql servelerless resource are both between 70-80% and the CPU usage on the node running the docker container is 100%

image

So I am assuming @bennil may be on to something given this is happening in high load, high concurrency scenarios only for us, where CPU usage is pushed to the limit.

dazinator commented 2 years ago
arielmoraes commented 2 years ago

A few days ago we've found the solution for our case. If you are using Azure Load Balancer + SQL Database (or SQL Pool) in a VNET, check if your configuration for outbound connections is set to use the Azure Private Network, otherwise, you could be running into SNAT Port Exhaustion.

To check if that's the case, go to the Load Balancer resource page, Metrics, and show SNAT Connection Count, filter for failed Statuses.

Another solution is to use more Outbound IPs.

geomorillo commented 2 years ago

In my last post I guessed the 258 error occurs because of full tcp buffers and wrong tcp socket handling on linux (see my previous post above). Further investigation made me more confident that this is the case.

The additional queue I added (last post) helped a bit, but after adding more features (service implementations) to our host process the error came back quite frequently.

Because most 258 errors occured in DbContext.SaveChanges() I synchronized calls in my DbContext base implementation like this:

public class MyBaseDbContext : DbContext
{
    private static readonly object syncObj = new object();

    public override int SaveChanges()
    {
        lock (syncObj)
        {
            return base.SaveChanges();
        }
    }
}

This dirty code slowed down my execution flow and helped reducing the 258 errors. But like others reported here about disabling MARS it did not solve the problem completly.

I examined that the problem occurs often during extensive external web API rest calls. So I synchronized them as well:

static SemaphoreSlim semaphoreSlimGeocodeLock = new SemaphoreSlim(1, 1);

public void MyRestCallMethod() 
{
    try
    {
        await semaphoreSlimGeocodeLock.WaitAsync();

        HttpClient httpClient = new HttpClient();
        string result = await httpClient.GetStringAsync(url);
        ...

    }
    finally
    {
        httpClient.Dispose();
        semaphoreSlimGeocodeLock.Release();
    }
}

No parallel web calls helped a lot and made me sure the error is related to the current tcp buffer usage within the application.

I guess the error originates in some way like this:

  1. SqlClient reads/writes to the sql server tcp socket connection.
  2. Because of the buffer handling bug the read/write loop returns indicating the remotehost has terminated the connection.
  3. SqlClient internal reconnect feature reconnects immediately to the SQL Server.
  4. On the next tcp read/write the same error occures because the application tcp buffer is still full.
  5. This is repeated many times and there might be a threading issue playing a role as well.
  6. The application is freezed and timeout "Unknwon error 258" occurs

Slowing down the execution flow and hoping not reaching tcp buffer limits is not a production ready workaround :) I wonder if this is a common problem for all high tcp i/o apps running on Linux and connecting to Sql Server?

This error occured some time after upgrading ef core (not sure which .net core/ef core version). My application runs currently on:

  • Bare metal server: AMD Ryzen 7 2700X
  • Ubuntu 18.04.2 LTS
  • .NET 5.0
  • Microsoft.EntityFrameworkCore.SqlServer 5.0.1

Used SQL Server: Latest SQL Server 2019 Docker container

Side notes:

  • Upgrading from a two years old SQL Server 2017 Container to the latest 2019 version did not change anything.
  • I did try collecting traces with dotnet-trace collect --process-id 27455 --profile database but the results did not include the expected events.

My suggestion is create a singleton service serving only one instance of HttpClient.

oyvost commented 2 years ago

258 timeout is a common exception when the DTU limit is reached on Azure SQL. If on Azure, you can try to monitor the Max DTU percentage and see if it hits the limit.

dazinator commented 2 years ago

A few days ago we've found the solution for our case. If you are using Azure Load Balancer + SQL Database (or SQL Pool) in a VNET, check if your configuration for outbound connections is set to use the Azure Private Network, otherwise, you could be running into SNAT Port Exhaustion.

To check if that's the case, go to the Load Balancer resource page, Metrics, and show SNAT Connection Count, filter for failed Statuses.

Another solution is to use more Outbound IPs.

We were originally connecting from our az vnet to the az sql server over public connection so we looked into what you said about SNAT. We didn't have a standard load balancer either or a NAT gateway on our vnet, only az gateway for ingress.

So this week I did some things.

  1. Added NAT gateway, 1 outbound public ip.
  2. Re-ran Load test - same error occurred.
  3. Disabled public connectivity and configured private link / private endpoint instead.
  4. Re-ran Load tests - same error is occurring.

I beleive this rules out SNAT Port Exhaustion as the cause of our issue.

Also we only see this when multiple threads each trying to save large amount of records (5000) to the same tables. If we reduce the volume of records being inserted per concurrent thread, and number of threads, we don't see the error.

DTU does not appear to be reaching near the limit. We have 'serverless' configuration and CPU does spike as expected for high load.

I guess this could just be a genuine timeout due to sql server not being able to insert these records to the same table. Perhaps there is locking or some such. I'm moving down the path that this is just a genuine timeout due to the server being busy under high Load scenario. In this case I am going to try just increasing the command timeout to see if that makes a difference. I will also try to diagnose the inserts on the sql server side to see if it is hitting some bottleneck there.

dazinator commented 2 years ago

Just an update to say, increasing the sql command timeout "solved" this issue. Not sure why the sql commands are so slow under this load test, they insert records into a table, I'm assuming it's related to azure sql dB performance but not sure where to check.. leaving that to one side for now.

MarvinNorway commented 2 years ago

I'm intermittently seeing the same issue with a self-hosted SQL Server Instance.

Microsoft SQL Server 2016 (SP1) (KB3182545) - 13.0.4001.0 (X64) Oct 28 2016 18:17:30 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows Server 2016 Standard 6.3 <X64> (Build 14393: ) (Hypervisor)

The client side uses .NET 4.8 with Entity Framework Core 3.1.5 (Code First) on Windows 10 Enterprise 21H2 19044.1766.

My code is extremely simple:

private void Insert(IList<T> data)
{
    using (MyDbContext dbContext = MyDbContextFactory.CreateDbContext())
    {
        dbContext.AddRange(data);
        dbContext.SaveChanges();
    }
}

Even when I try to insert only 50 rows (with some child rows in another table), I see the error mentioned here:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out

at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 820  
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 1572  
at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1070  
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2744  
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2258  
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2182  
at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1222  
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1450  
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 2190  
at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 2088  
at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 9384  
at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 1311  
at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 1091  
at Microsoft.Data.SqlClient.SqlInternalTransaction.Commit() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\sqlinternaltransaction.cs:line 344  
at Microsoft.Data.SqlClient.SqlTransaction.Commit() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlTransaction.cs:line 202  
at Microsoft.EntityFrameworkCore.Storage.RelationalTransaction.Commit()  
at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.Execute(IEnumerable`1 commandBatches, IRelationalConnection connection)  
at Microsoft.EntityFrameworkCore.Storage.RelationalDatabase.SaveChanges(IList`1 entries)  
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChanges(IList`1 entriesToSave)  
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChanges(DbContext _, Boolean acceptAllChangesOnSuccess)  
at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)  
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChanges(Boolean acceptAllChangesOnSuccess)  
at Microsoft.EntityFrameworkCore.DbContext.SaveChanges(Boolean acceptAllChangesOnSuccess)  
at Microsoft.EntityFrameworkCore.DbContext.SaveChanges()

It has the same NativeErrorCode=258.

I have increased the timeout to as much as 1200 seconds (20 minutes), and I still get timeouts - much earlier than the timeout I configured (after 90 seconds). Here's how the DbContext is created:

    static class MyDbContextFactory
    {
        public static MyDbContext CreateDbContext()
        {
            DbContextOptions<MyDbContext> options = new DbContextOptionsBuilder<MyDbContext>()
                                                                .UseSqlServer(GlobalConfiguration.GetConnectionString(), ConfigureSqlServerOptions)
                                                                .Options;

            return new MyDbContext(options);
        }

        private static void ConfigureSqlServerOptions(SqlServerDbContextOptionsBuilder opts)
        {
            opts.EnableRetryOnFailure();

            int? dbCommandTimeout = GlobalConfiguration.DbCommandTimeout;
            if (dbCommandTimeout.HasValue)
            {
                opts.CommandTimeout(dbCommandTimeout.Value);
            }
        }
    }

And the context itself is very simple, too:

sealed class MyDbContext : DbContext
{
    public MyDbContext(DbContextOptions<SharePointExportContext> options) : base(options)
    {
    }

    // [...]
}

Any idea what could be causing this?

ErikEJ commented 2 years ago

@MarvinNorway which Microsoft.Data.SqlClient version - and have you tried 4.1?

MarvinNorway commented 2 years ago

@ErikEJ I see that both Microsoft.Data.SqlClient and Microsoft.Data.SqlClient.SNI are 1.1.1. Seems like it's ancient. I'll try with a newer version...

MarvinNorway commented 2 years ago

@ErikEJ Upgrading Microsoft.Data.SqlClient to 4.1.0 and Microsoft.Data.SqlClient.SNI to 4.0.0 solved the problem... Thanks. They were buried among a couple other packages that I can't upgrade, that's why I didn't see that they were so outdated...

shahid-aman-maersk commented 2 years ago

@MarvinNorway we're using 4.1.0 of SqlClient and Functions V4, deployed to Linux app service , still get this error - what kind of app is yours? Microsoft.Data.SqlClient.SqlException: at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.TryRun (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.Run (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.RunParser (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsyncContinuedOnSuccess (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsyncContinued (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServerInternalRestContinuedAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServerInternalRestAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServerInternalAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteRowSourceToServerAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServer (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)

gamblen commented 2 years ago

I am having this issue intermittently with a .net 6.0 app running in containers on Azure Kubernetes Service accessing an on prem SQL Server,

I am using Micrsooft.Data.SqlClient.dll v5.0.0.

The Microsoft.Data.SqlClient.SNI.dll is only in the directory runtimes\win-*\native so I assume it isn't going to get loaded

I have checked the SNAT Connection Count where the State is Failed and there are none.

We are genuinely considering either changing the hosting off linux hosts or more likely, swapping to a different DB tech as this is such a problem.

jkauppinen commented 2 years ago

Experienced a flavor of this issue. When inserting singular records in intervals of a few seconds in concurrently running applications. Each application operating on their own instance of DbContext and no concurrent use inside their scope.

Packages Microsoft.Data.SqlClient 5.0.1 Microsoft.EntityFrameworkCore.SqlServer 6.0.10

Exception

An exception occurred in the database while saving changes for context type '"XXX.XXXX.XXXX.XXXContext"'.
Microsoft.EntityFrameworkCore.DbUpdateException: An error occurred while saving the entity changes. See the inner exception for details.
 ---> 
 Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. 
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
  at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__207_0(Task`1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)---
    End of stack trace from previous location --- at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
    --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken)ClientConnectionId:554eb334-af3c-400d-92df-1c1c1dd92d2bError Number:-2,State:0,Class:11 --- End of inner exception stack trace --- at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IList`1 entriesToSave, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(StateManager stateManager, Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()--- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)

Client .NET 6.0 application implemented as BackgroundService.

Error occured on

 _XContext.X.Add(x);
await _XContext.SaveChangesAsync(cancellationToken);

Last EF generated SQL command before this exception was thrown

SET NOCOUNT ON; 
INSERT INTO [X] ([A], [B], [C], [D], [E], [F], [G]) 
VALUES (@p0, @p1, @p2, @p3, @p4, @p5, @p6); 

SELECT [Z], [K] FROM [X] 
WHERE @@ROWCOUNT = 1 AND [Z] = scope_identity()

Database SQL Server 2016

Resolution for now Optimizing expensive read queries as much as possible that affected the table. After optimizing heavier read queries, I haven't seen this error.

I had several applications concurrently using the database for one expensive read query each on startup, and then making inserts on same tables (not same records). Each had their own scoped instance of DbContext. The issue seemed to come when inserts were occurring to a table that was under a heavy read query by another process. Small percentage of inserts resulted in this error. Enabling retries didn't seem to help.

lcheunglci commented 2 years ago

This sounds like a thread starvation issue with the system thread pool. This is known issue when running asynchronously with Microsoft.Data.SqlClient on .NET Core and it's in our backlog to make it fully asynchronous which will require a big refactor. You might be able to work around this by setting the ThreadPool.SetMinThreads to a high enough value to accommodate, but with many parallel tasks, there'll be more overhead to allow for continuation tasks inside of SqlClient to run before the timeouts are hit.

dazinator commented 1 year ago

@lcheunglci - do you know of a good way to monitor the thread pool to determine if this is indeed thread starvation causing the issue?

  1. We only see this with large queries. Either inserting lots of data or pulling lots of data.
  2. Azure SQL, no where near max DTU, there is no SNAT exhaustion.
  3. Easier for us to reproduce with our load test that does many concurrent REST API calls each inserting lots of data,, but is also happening now in prod when seamingly there wasn't much concurrency, just one EF core large query fired quickly after another in seperate request (a user was paging through a grid in the UI)

Due to number 3) we would be surprised if the thread pool was starved as from our server logs not much was happening apart from one user using a grid at this time. We'd like to be able verify that though.

dazinator commented 1 year ago

@lcheunglci Apologies, the error I am seeing is a 258, but it's not the "ReadSniSyncOverAsync" variety. It's actually this one here: https://github.com/dotnet/SqlClient/issues/1530 will move my comments there

bennil commented 1 year ago

When will this major years old linux issue be fixed? Using MS-SQL Docker on Linux fells like MySQL in the nineties.

sajidur commented 1 year ago

My platform Infrastructure: Docker in Redhat Enterprise Language & Framework: .Net 6 with Entity Framework core Database: SQL Server

After optimizing my API and implementing 'async' and 'await' with 'Task', all the errors have disappeared. Also i set maximum and minimum pooling and connection life cycle

EvgenyGrishnov commented 1 year ago

Could you clarify what do you mean here: "implementing 'asyn'c and 'await' with 'Task'" ? Was your API synchronized when that error appeared?

sajidur commented 1 year ago

Could you clarify what do you mean here: "implementing 'asyn'c and 'await' with 'Task'" ? Was your API synchronized when that error appeared?

just i make my api asyncronized. Yes it was synchronized.

tobyreid commented 1 year ago

I had a very similar problem recently:

In my case I have a dockerized webapi where I start a new TransactionScope whenever a PUT/POST/PATCH/DELETE call is received. Depending on the operation, one or more connections are open to different Azure SQL databases via EF Core (and ultimately the SqlClient) - queries to the first connection will work, but any queries to other connections won't as I'm not using the Azure DTC preview.

Not creating a TransactionScope solved my immediate issue in these instances..

The exception was the same @frankyuan's error and appeared intermittently in logs, due to application usage and behaviour:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\n
---> System.ComponentModel.Win32Exception (258): Unknown error 258\n

....which was not a super helpful error message - while trying to figure this out 👎

svap-roshan commented 1 year ago

Describe the bug

When executing SQL such as SELECT FieldA, FieldB FROM A INNER JOIN C ON A.FieldId = C.FieldId UNION SELECT FieldA, FieldD FROM A INNER JOIN D ON A.FieldId = D.FieldId, throw the error like below, not every time, just a little part of queries have this issue.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\n
---> System.ComponentModel.Win32Exception (258): Unknown error 258\n
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction)\n 
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Span1 buff, Int32 len, Int32& totalRead)\n
at Microsoft.Data.SqlClient.TdsParser.TrySkipValue(SqlMetaDataPriv md, Int32 columnOrdinal, TdsParserStateObject stateObj)\n
at Microsoft.Data.SqlClient.TdsParser.TrySkipRow(SqlMetaDataSet columns, Int32 startCol, TdsParserStateObject stateObj)\n
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n
at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)\n
at Microsoft.Data.SqlClient.SqlDataReader.ReadAsync(CancellationToken cancellationToken)\n
--- End of stack trace from previous location where exception was thrown ---\n
at Dapper.SqlMapper.QueryAsync[T](IDbConnection cnn, Type effectiveType, CommandDefinition command) in //Dapper/SqlMapper.Async.cs:line 437\n

To reproduce

Sorry, currently can't reproduce in the local environment, so can't provide more detail to reproduce.

Expected behavior

SQL should execute successfully every time.

Further technical details

Microsoft.Data.SqlClient version: 1.1.3 .NET target: Core 3.1 Operating system: Docker container

What I found/tried

https://stackoverflow.com/questions/57270245/sql-server-dbcommand-timeout-with-net-core-container-under-load

DapperLib/Dapper#1435

Were you able to resolve this?

bennil commented 1 year ago

Minimizing Task.Run(...) calls did not solve the problem in my case. Looks like this is a regular threading bug.

Read the other day of a SNI threading bugfix in a ef core preview version (can't find it right now). Hope this will solve this leaking problem soon.

Malgefor commented 1 year ago

Still receiving this error quite often on .NET 7 running in a Linux (Debian) container. Any update on what might be the cause?

svap-roshan commented 1 year ago

@Malgefor Convert all asynchronous SQL calls into synchronous ones, and everything will be sorted :)

jbogard commented 1 year ago

We are also experiencing this issue with Linux App Services against Azure SQL. We only began seeing this after migrating our App Services from Windows to Linux. Also on 5.1.1 of SqlClient.

sliekens commented 1 year ago

I spent a few hours fighting this error. In my case, it was really just a slow transaction that timed out (I use the default CommandTimeout).

My app runs in a dotnet 6.0-bullseye-slim Docker container. I use NServiceBus to handle incoming messages, up to 25 concurrently. When a message arrives, I use EF Core 7.0 to fetch a record from the database using SingleOrDefaultAsync and then save edits to the record using SaveChangesAsync. Sometimes this takes longer than the command timeout. It's a bit unfortunate that timeouts result in an Unknown Error, which I guess is a Linux-only problem. (Why?)

``` Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (258): Unknown error 258 at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction2005(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) at Microsoft.Data.SqlClient.SqlInternalTransaction.Commit() at Microsoft.Data.SqlClient.SqlTransaction.Commit() at System.Data.Common.DbTransaction.CommitAsync(CancellationToken cancellationToken) --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalTransaction.CommitAsync(CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalTransaction.CommitAsync(CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IList`1 entriesToSave, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(StateManager stateManager, Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken) ClientConnectionId:2762e296-cb69-492f-811e-7f35977086b1 Error Number:-2,State:0,Class:11 ```

The stacktrace doesn't always look the same. Here is another example without ReadSniSyncOverAsync:

``` Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. Operation cancelled by user. ---> System.ComponentModel.Win32Exception (258): Unknown error 258 at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction2005(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) at Microsoft.Data.SqlClient.SqlInternalTransaction.Commit() at Microsoft.Data.SqlClient.SqlTransaction.Commit() at System.Data.Common.DbTransaction.CommitAsync(CancellationToken cancellationToken) --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalTransaction.CommitAsync(CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalTransaction.CommitAsync(CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IList`1 entriesToSave, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(StateManager stateManager, Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken) ClientConnectionId:711c262b-080f-49bb-aa05-12539b982735 Error Number:-2,State:0,Class:11 ```

A side-effect of this Unknown Error seems to be that the connection is NOT (always?) returned to the connection pool. I think this has also been discussed previously as a side-effect of using sync-over-async, but I don't understand the machinery that leads to the connection remaining open but unavailable for reuse.

Connections not being returned to the pool causes my application to come to almost a complete stop, with 100 open connections left unused. I have confirmed on SQL Server that my application has 100 active sessions, yet my message handlers are all crashing with a timeout due to connection pool starvation.

``` System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached. at Microsoft.Data.Common.ADP.ExceptionWithStackTrace(Exception e) --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected) at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync ```

In the end, I just ended up rewriting my queries and adding indexes, so the timeouts disappeared. But it still seems to me that there is something not really working right in the SqlClient on Linux.

co7e commented 11 months ago

This issue has been open for over 3 years and I think that there are probably a lot of people who would be really appreciative of an update on this, even if it is just to say this isn't getting fixed any time soon.

We experienced this issue about a year ago. We ended up shifting our hosting to Windows, at additional cost, to be sure that we wouldn't encounter it again. I came back here a year later expecting (well probably more like hoping) that it would have been resolved, or at least for some sort of update.

IMO, this is a massive issue for a huge number of teams wanting to host on Linux. I'd probably go as far as saying that you shouldn't really be hosting on Linux in a production environment if you are using SqlClient. That's a lot of people and a really big deal. As far as I can tell, any resolutions anyone has above are just workarounds. They are not really fixing the issue. If I'm wrong, please let me know.

Of the above workarounds, is there one that will totally prevent the issue from occurring, please?

Thanks in advance for any feedback.

jperlope commented 11 months ago

I would like to share my experience on this. I used to get exactly the same exception/stack trace when running a query. It was odd because running the execution plan it always showed super optimized, it seemed not a performance problem, but often the exception occurred. We had no idea what to do. But since a time ago, we are rebuilding indexes every week, and exception was not raised anymore. My conclusion is, sometimes SQL server was doing something, or some overload in the SQL server but for any reason refreshing indexes reduces these overloads.

MichelZ commented 11 months ago

For us, increasing the threadpool threads eliminated this issue. We're still on linux and are not seeing this (anymore)

bailsman commented 11 months ago

We also started seeing unknown error 258 after migrating from Windows to Linux App Services, but strangely only on one of our environments.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__209_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.<PopulateSplitIncludeCollectionAsync>g__InitializeReaderAsync|27_0[TIncludingEntity,TIncludedEntity](RelationalQueryContext queryContext, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.PopulateSplitIncludeCollectionAsync[TIncludingEntity,TIncludedEntity](Int32 collectionId, RelationalQueryContext queryContext, IExecutionStrategy executionStrategy, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, SplitQueryResultCoordinator resultCoordinator, Func`3 childIdentifier, IReadOnlyList`1 identifierValueComparers, Func`5 innerShaper, Func`4 relatedDataLoaders, INavigationBase inverseNavigation, Action`2 fixup, Boolean trackingQuery)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.TaskAwaiter(Func`1[] taskFactories)
   at Microsoft.EntityFrameworkCore.Query.Internal.SplitQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

@MichelZ Out of curiosity, what did you increase the minimum threads to?

MichelZ commented 11 months ago

We also started seeing unknown error 258 after migrating from Windows to Linux App Services, but strangely only on one of our environments.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__209_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.<PopulateSplitIncludeCollectionAsync>g__InitializeReaderAsync|27_0[TIncludingEntity,TIncludedEntity](RelationalQueryContext queryContext, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.PopulateSplitIncludeCollectionAsync[TIncludingEntity,TIncludedEntity](Int32 collectionId, RelationalQueryContext queryContext, IExecutionStrategy executionStrategy, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, SplitQueryResultCoordinator resultCoordinator, Func`3 childIdentifier, IReadOnlyList`1 identifierValueComparers, Func`5 innerShaper, Func`4 relatedDataLoaders, INavigationBase inverseNavigation, Action`2 fixup, Boolean trackingQuery)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.TaskAwaiter(Func`1[] taskFactories)
   at Microsoft.EntityFrameworkCore.Query.Internal.SplitQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

@MichelZ Out of curiosity, what did you increase the minimum threads to?

We went aggressive/overkill, but we haven't seen negative side effects currently.

  ThreadPool.GetMaxThreads(out int workerThreads, out int completionPortThreads);
  ThreadPool.SetMaxThreads(Math.Max(workerThreads, 2048), Math.Max(completionPortThreads, 2048));
  ThreadPool.SetMinThreads(1024, 512);
co7e commented 11 months ago

In my last post I guessed the 258 error occurs because of full tcp buffers and wrong tcp socket handling on linux (see my previous post above). Further investigation made me more confident that this is the case.

The additional queue I added (last post) helped a bit, but after adding more features (service implementations) to our host process the error came back quite frequently.

Because most 258 errors occured in DbContext.SaveChanges() I synchronized calls in my DbContext base implementation like this:

public class MyBaseDbContext : DbContext
{
    private static readonly object syncObj = new object();

    public override int SaveChanges()
    {
        lock (syncObj)
        {
            return base.SaveChanges();
        }
    }
}

This dirty code slowed down my execution flow and helped reducing the 258 errors. But like others reported here about disabling MARS it did not solve the problem completly.

I examined that the problem occurs often during extensive external web API rest calls. So I synchronized them as well:

static SemaphoreSlim semaphoreSlimGeocodeLock = new SemaphoreSlim(1, 1);

public void MyRestCallMethod() 
{
    try
    {
        await semaphoreSlimGeocodeLock.WaitAsync();

        HttpClient httpClient = new HttpClient();
        string result = await httpClient.GetStringAsync(url);
        ...

    }
    finally
    {
        httpClient.Dispose();
        semaphoreSlimGeocodeLock.Release();
    }
}

No parallel web calls helped a lot and made me sure the error is related to the current tcp buffer usage within the application.

I guess the error originates in some way like this:

  1. SqlClient reads/writes to the sql server tcp socket connection.
  2. Because of the buffer handling bug the read/write loop returns indicating the remotehost has terminated the connection.
  3. SqlClient internal reconnect feature reconnects immediately to the SQL Server.
  4. On the next tcp read/write the same error occures because the application tcp buffer is still full.
  5. This is repeated many times and there might be a threading issue playing a role as well.
  6. The application is freezed and timeout "Unknwon error 258" occurs

Slowing down the execution flow and hoping not reaching tcp buffer limits is not a production ready workaround :) I wonder if this is a common problem for all high tcp i/o apps running on Linux and connecting to Sql Server?

This error occured some time after upgrading ef core (not sure which .net core/ef core version). My application runs currently on:

  • Bare metal server: AMD Ryzen 7 2700X
  • Ubuntu 18.04.2 LTS
  • .NET 5.0
  • Microsoft.EntityFrameworkCore.SqlServer 5.0.1

Used SQL Server: Latest SQL Server 2019 Docker container

Side notes:

  • Upgrading from a two years old SQL Server 2017 Container to the latest 2019 version did not change anything.
  • I did try collecting traces with dotnet-trace collect --process-id 27455 --profile database but the results did not include the expected events.

Thought I'd add that when we investigated this a year ago, we felt that analysis from @bennil, above, married up with what we were seeing.

Not sure I'm comfortable with your solution @MichelZ even though you haven't suffered any negative consequences. Great it is working for you though.

bennil commented 11 months ago

Thanks @co7e for bringing this matter up again.

My insight after watching/analyzing this for years are the following:

SqlClient is not thread safe on Linux (saw it on Intel and AMD). Reducing parallel work or adjust thread execution might help but not solve the problem. Sooner or later the exception will hit you.

Surprisingly Microsoft doesn't care for years even though a huge number of enterprise software is running on this db access lottery.

The bug is hard to reproduce as all threading issues are. I have some ideas how to construct testcode for further investigations, but this is not a charity project...

dazinator commented 11 months ago

We also started seeing unknown error 258 after migrating from Windows to Linux App Services, but strangely only on one of our environments.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__209_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.<PopulateSplitIncludeCollectionAsync>g__InitializeReaderAsync|27_0[TIncludingEntity,TIncludedEntity](RelationalQueryContext queryContext, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.PopulateSplitIncludeCollectionAsync[TIncludingEntity,TIncludedEntity](Int32 collectionId, RelationalQueryContext queryContext, IExecutionStrategy executionStrategy, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, SplitQueryResultCoordinator resultCoordinator, Func`3 childIdentifier, IReadOnlyList`1 identifierValueComparers, Func`5 innerShaper, Func`4 relatedDataLoaders, INavigationBase inverseNavigation, Action`2 fixup, Boolean trackingQuery)
   at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.TaskAwaiter(Func`1[] taskFactories)
   at Microsoft.EntityFrameworkCore.Query.Internal.SplitQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

@MichelZ Out of curiosity, what did you increase the minimum threads to?

Your stack trace doesn't contain Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() - be aware that error 258 is a fairly broad error, see my comment https://github.com/dotnet/SqlClient/issues/647#issuecomment-1348177080

bailsman commented 10 months ago

In our case the problem has now as mysteriously disappeared as it has mysteriously appeared. The only change we made was migrate to a different virtual machine than the previous SQL Server was running on.

By taking some packet captures by logging in with ssh on the Linux Web App, and typing ip a to get the interface name and then tcpdump -i vnet0g6qkt5hl -w ~/network_traces/0001.pcap we noticed something strange while the problem was still occurring.

A packet containing a SQL query that in the tcpdump was registered with size 2130 bytes and data length 2064 was being "partially acked". After sending the large packet out, the ACK came back only for the first 1398 bytes, causing a retransmission of the last 666 bytes. However, in response to this retransmission only came another ACK for only the first 1398 bytes, causing another retransmission of the last 666 bytes, and so on and so forth, until eventually the connection timed out. Rather than a threading issue this potentially seems like some kind of network layer problem. However, we can no longer reproduce it, so it's difficult to investigate more deeply.

As @dazinator already suggested our particular problem was probably something entirely different than what this ticket is about. The 258 timeout error is fairly general - it just means that some kind of wait operation timed out and it can have multiple different underlying causes. Sorry for the noise.

jaq316 commented 8 months ago

In our case the problem has now as mysteriously disappeared as it has mysteriously appeared. The only change we made was migrate to a different virtual machine than the previous SQL Server was running on.

By taking some packet captures by logging in with ssh on the Linux Web App, and typing ip a to get the interface name and then tcpdump -i vnet0g6qkt5hl -w ~/network_traces/0001.pcap we noticed something strange while the problem was still occurring.

A packet containing a SQL query that in the tcpdump was registered with size 2130 bytes and data length 2064 was being "partially acked". After sending the large packet out, the ACK came back only for the first 1398 bytes, causing a retransmission of the last 666 bytes. However, in response to this retransmission only came another ACK for only the first 1398 bytes, causing another retransmission of the last 666 bytes, and so on and so forth, until eventually the connection timed out. Rather than a threading issue this potentially seems like some kind of network layer problem. However, we can no longer reproduce it, so it's difficult to investigate more deeply.

As @dazinator already suggested our particular problem was probably something entirely different than what this ticket is about. The 258 timeout error is fairly general - it just means that some kind of wait operation timed out and it can have multiple different underlying causes. Sorry for the noise.

We have also experienced the same issue.

The solution, for us, was to limit the packet size in the Connection string by appending ;Packet Size=512. We chose 512 as it is the minimum size.

See https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.packetsize?view=dotnet-plat-ext-8.0 for more info on SqlConnection.PacketSize property.

Vandersteen commented 8 months ago

Could this be related ? https://medium.com/swlh/fix-a-random-network-connection-reset-issue-in-docker-kubernetes-5c57a11de170

bennil commented 8 months ago

https://www.codeproject.com/Tips/5378079/Challenges-in-Migrating-ASP-NET-Apps-to-Containers

Maybe after all it relates to MARS? (DB connection string: MultipleActiveResultSets=True)

dazinator commented 8 months ago

https://www.codeproject.com/Tips/5378079/Challenges-in-Migrating-ASP-NET-Apps-to-Containers

Maybe after all it relates to MARS? (DB connection string: MultipleActiveResultSets=True)

Not in mine or many other cases. The MARS issue was seperate as far as I know. I use MultipleActiveResultSets=False and still saw this problem.

Note: After moving the work to a seperate dedicated process (and therefore removing contention of the application thread pool in the process) I no longer see this issue. The new process starts - on the same machine, runs the same queries, and terminates as expected. Also, not only are we running this logic in a dedicated process, we have 10 jobs to run, rather than them all being run in application at same time as before, they are now 10 seperate dedicated processes that start then terminate. So this has resulted in 11 seperate processes, each with own thread pool and using own connection.

So it seems to be to do with:-

  1. A ramp up in load - placed on either a connection or the thread pool
dazinator commented 8 months ago

I also wondered if Azure SQL has some sort of connection "rejection" mechanism, such that when it scales up or CPU usage is high, it can reject queries with a 258 "immediately" and that perhaps when a connection is in this state, it doesn't clear. A bit like a rate limiting feature. However this is pure speculation, I don't have the networking skills to trace what packets is being sent when this occurs, as our apps run in docker, on linux azure VM's and things get pretty complicated pretty fast.