Closed frankyuan closed 5 months ago
As a workaround, you can bump up server's max worker threads that will allow other pending tasks to occur and close connections that are to be closed.
@cheenamalhotra what if you are running inside Azure SQL Pool and the Workers count is not hitting the limit?
I managed to reproduce the issue with Mars enabled (another issue), but the Unkown error 258 and some others are happening to our application when running in production too. In the same scenario, when the system is under heavy load exceptions start to happen.
We need a fix for that ASAP as we are running financial services.
@arielmoraes are you able to provide a sample repro? MARS enabled is probably issue #422 which is a different story, or if you are able to capture EventSource traces that at least shows us where the issue has happened?
@JRahnama I managed to set a sample repro to get the Events when using MARS, will post the results on the other issue.
For this issue, I'll have to enable the event source in production which will generate A LOT of events, if I manage to create a sample repro I'll post the results here.
In my last post I guessed the 258 error occurs because of full tcp buffers and wrong tcp socket handling on linux (see my previous post above). Further investigation made me more confident that this is the case.
The additional queue I added (last post) helped a bit, but after adding more features (service implementations) to our host process the error came back quite frequently.
Because most 258 errors occured in DbContext.SaveChanges() I synchronized calls in my DbContext base implementation like this:
public class MyBaseDbContext : DbContext
{
private static readonly object syncObj = new object();
public override int SaveChanges()
{
lock (syncObj)
{
return base.SaveChanges();
}
}
}
This dirty code slowed down my execution flow and helped reducing the 258 errors. But like others reported here about disabling MARS it did not solve the problem completly.
I examined that the problem occurs often during extensive external web API rest calls. So I synchronized them as well:
static SemaphoreSlim semaphoreSlimGeocodeLock = new SemaphoreSlim(1, 1);
public void MyRestCallMethod()
{
try
{
await semaphoreSlimGeocodeLock.WaitAsync();
HttpClient httpClient = new HttpClient();
string result = await httpClient.GetStringAsync(url);
...
}
finally
{
httpClient.Dispose();
semaphoreSlimGeocodeLock.Release();
}
}
No parallel web calls helped a lot and made me sure the error is related to the current tcp buffer usage within the application.
I guess the error originates in some way like this:
Slowing down the execution flow and hoping not reaching tcp buffer limits is not a production ready workaround :) I wonder if this is a common problem for all high tcp i/o apps running on Linux and connecting to Sql Server?
This error occured some time after upgrading ef core (not sure which .net core/ef core version). My application runs currently on:
Used SQL Server: Latest SQL Server 2019 Docker container
Side notes:
dotnet-trace collect --process-id 27455 --profile database
but the results did not include the expected events.@bennil and those who are using NetCore, if possible, can you test with MultiSubnetFailover=true
?
I know it could be a very long shot, but reading this I can't stop thinking, could that be related to continuations trying to execute in the wrong SynchronizationContext
?
@JRahnama we added MultiSubnetFailover=true to our connection string. I don't know if we can reproduce the problem correctly because our last change did further minimize parallel work and the error occured then only once during the day. So we need some days to guess if this option changed anything.
@arielmoraes: Interesting thought. I did search our codebase and found some improper use of Task.Result. Can this affect the synchronous db save method? Or did we just reach a parallel limit using MARS and to many db read/writes worker tasks?
最近观察到业务高峰期会有大量的Unknown error 258,在数据库服务器上观测到cpu 100%,是由 sql server windows nt -64bit 占用 98%+ 没有死锁 使用 select * from master.dbo.sysprocesses 查询到连接数在300左右 当前系统有 170左右 当前系统数据库连接串: Data Source=xxxxxxx,15726;Initial Catalog=xxxxx;Persist Security Info=True;User ID=xxxx;Password=xxxxxx;Integrated Security=false;Connect Timeout=30;MultipleActiveResultSets=false
有两个大并发业务连接到这台数据库 因为一些sql语句 较慢,经过优化后,今天再次使用 select * from master.dbo.sysprocesses 查询当前连接在 60左右,已经没有Unknown error 258 的记录了
同时:数据库的最大连接数设置为0
Without wanting to reveal too much about our infrastructure setup on a public domain, We see this error pretty consistently when load testing a web api, that inserts a large volume of data into an azure sql database.
Details that might hopefully enable a repo:
At the time of the error, for example, below at 3:18pm, the cpu and io usage of the sql servelerless resource are both between 70-80% and the CPU usage on the node running the docker container is 100%
So I am assuming @bennil may be on to something given this is happening in high load, high concurrency scenarios only for us, where CPU usage is pushed to the limit.
A few days ago we've found the solution for our case. If you are using Azure Load Balancer + SQL Database (or SQL Pool) in a VNET, check if your configuration for outbound connections is set to use the Azure Private Network, otherwise, you could be running into SNAT Port Exhaustion.
To check if that's the case, go to the Load Balancer resource page, Metrics, and show SNAT Connection Count, filter for failed Statuses.
Another solution is to use more Outbound IPs.
In my last post I guessed the 258 error occurs because of full tcp buffers and wrong tcp socket handling on linux (see my previous post above). Further investigation made me more confident that this is the case.
The additional queue I added (last post) helped a bit, but after adding more features (service implementations) to our host process the error came back quite frequently.
Because most 258 errors occured in DbContext.SaveChanges() I synchronized calls in my DbContext base implementation like this:
public class MyBaseDbContext : DbContext { private static readonly object syncObj = new object(); public override int SaveChanges() { lock (syncObj) { return base.SaveChanges(); } } }
This dirty code slowed down my execution flow and helped reducing the 258 errors. But like others reported here about disabling MARS it did not solve the problem completly.
I examined that the problem occurs often during extensive external web API rest calls. So I synchronized them as well:
static SemaphoreSlim semaphoreSlimGeocodeLock = new SemaphoreSlim(1, 1); public void MyRestCallMethod() { try { await semaphoreSlimGeocodeLock.WaitAsync(); HttpClient httpClient = new HttpClient(); string result = await httpClient.GetStringAsync(url); ... } finally { httpClient.Dispose(); semaphoreSlimGeocodeLock.Release(); } }
No parallel web calls helped a lot and made me sure the error is related to the current tcp buffer usage within the application.
I guess the error originates in some way like this:
- SqlClient reads/writes to the sql server tcp socket connection.
- Because of the buffer handling bug the read/write loop returns indicating the remotehost has terminated the connection.
- SqlClient internal reconnect feature reconnects immediately to the SQL Server.
- On the next tcp read/write the same error occures because the application tcp buffer is still full.
- This is repeated many times and there might be a threading issue playing a role as well.
- The application is freezed and timeout "Unknwon error 258" occurs
Slowing down the execution flow and hoping not reaching tcp buffer limits is not a production ready workaround :) I wonder if this is a common problem for all high tcp i/o apps running on Linux and connecting to Sql Server?
This error occured some time after upgrading ef core (not sure which .net core/ef core version). My application runs currently on:
- Bare metal server: AMD Ryzen 7 2700X
- Ubuntu 18.04.2 LTS
- .NET 5.0
- Microsoft.EntityFrameworkCore.SqlServer 5.0.1
Used SQL Server: Latest SQL Server 2019 Docker container
Side notes:
- Upgrading from a two years old SQL Server 2017 Container to the latest 2019 version did not change anything.
- I did try collecting traces with
dotnet-trace collect --process-id 27455 --profile database
but the results did not include the expected events.
My suggestion is create a singleton service serving only one instance of HttpClient.
258 timeout is a common exception when the DTU limit is reached on Azure SQL. If on Azure, you can try to monitor the Max DTU percentage and see if it hits the limit.
A few days ago we've found the solution for our case. If you are using Azure Load Balancer + SQL Database (or SQL Pool) in a VNET, check if your configuration for outbound connections is set to use the Azure Private Network, otherwise, you could be running into SNAT Port Exhaustion.
To check if that's the case, go to the Load Balancer resource page, Metrics, and show SNAT Connection Count, filter for failed Statuses.
Another solution is to use more Outbound IPs.
We were originally connecting from our az vnet to the az sql server over public connection so we looked into what you said about SNAT. We didn't have a standard load balancer either or a NAT gateway on our vnet, only az gateway for ingress.
So this week I did some things.
I beleive this rules out SNAT Port Exhaustion as the cause of our issue.
Also we only see this when multiple threads each trying to save large amount of records (5000) to the same tables. If we reduce the volume of records being inserted per concurrent thread, and number of threads, we don't see the error.
DTU does not appear to be reaching near the limit. We have 'serverless' configuration and CPU does spike as expected for high load.
I guess this could just be a genuine timeout due to sql server not being able to insert these records to the same table. Perhaps there is locking or some such. I'm moving down the path that this is just a genuine timeout due to the server being busy under high Load scenario. In this case I am going to try just increasing the command timeout to see if that makes a difference. I will also try to diagnose the inserts on the sql server side to see if it is hitting some bottleneck there.
Just an update to say, increasing the sql command timeout "solved" this issue. Not sure why the sql commands are so slow under this load test, they insert records into a table, I'm assuming it's related to azure sql dB performance but not sure where to check.. leaving that to one side for now.
I'm intermittently seeing the same issue with a self-hosted SQL Server Instance.
Microsoft SQL Server 2016 (SP1) (KB3182545) - 13.0.4001.0 (X64) Oct 28 2016 18:17:30 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows Server 2016 Standard 6.3 <X64> (Build 14393: ) (Hypervisor)
The client side uses .NET 4.8 with Entity Framework Core 3.1.5 (Code First) on Windows 10 Enterprise 21H2 19044.1766.
My code is extremely simple:
private void Insert(IList<T> data)
{
using (MyDbContext dbContext = MyDbContextFactory.CreateDbContext())
{
dbContext.AddRange(data);
dbContext.SaveChanges();
}
}
Even when I try to insert only 50 rows (with some child rows in another table), I see the error mentioned here:
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 820
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 1572
at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1070
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2744
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2258
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 2182
at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1222
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1450
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 2190
at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 2088
at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 9384
at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 1311
at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 1091
at Microsoft.Data.SqlClient.SqlInternalTransaction.Commit() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\sqlinternaltransaction.cs:line 344
at Microsoft.Data.SqlClient.SqlTransaction.Commit() in E:\SqlClientInternal\agent-1\_work\5\s\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlTransaction.cs:line 202
at Microsoft.EntityFrameworkCore.Storage.RelationalTransaction.Commit()
at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.Execute(IEnumerable`1 commandBatches, IRelationalConnection connection)
at Microsoft.EntityFrameworkCore.Storage.RelationalDatabase.SaveChanges(IList`1 entries)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChanges(IList`1 entriesToSave)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChanges(DbContext _, Boolean acceptAllChangesOnSuccess)
at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChanges(Boolean acceptAllChangesOnSuccess)
at Microsoft.EntityFrameworkCore.DbContext.SaveChanges(Boolean acceptAllChangesOnSuccess)
at Microsoft.EntityFrameworkCore.DbContext.SaveChanges()
It has the same NativeErrorCode=258.
I have increased the timeout to as much as 1200 seconds (20 minutes), and I still get timeouts - much earlier than the timeout I configured (after 90 seconds). Here's how the DbContext is created:
static class MyDbContextFactory
{
public static MyDbContext CreateDbContext()
{
DbContextOptions<MyDbContext> options = new DbContextOptionsBuilder<MyDbContext>()
.UseSqlServer(GlobalConfiguration.GetConnectionString(), ConfigureSqlServerOptions)
.Options;
return new MyDbContext(options);
}
private static void ConfigureSqlServerOptions(SqlServerDbContextOptionsBuilder opts)
{
opts.EnableRetryOnFailure();
int? dbCommandTimeout = GlobalConfiguration.DbCommandTimeout;
if (dbCommandTimeout.HasValue)
{
opts.CommandTimeout(dbCommandTimeout.Value);
}
}
}
And the context itself is very simple, too:
sealed class MyDbContext : DbContext
{
public MyDbContext(DbContextOptions<SharePointExportContext> options) : base(options)
{
}
// [...]
}
Any idea what could be causing this?
@MarvinNorway which Microsoft.Data.SqlClient version - and have you tried 4.1?
@ErikEJ I see that both Microsoft.Data.SqlClient and Microsoft.Data.SqlClient.SNI are 1.1.1. Seems like it's ancient. I'll try with a newer version...
@ErikEJ Upgrading Microsoft.Data.SqlClient to 4.1.0 and Microsoft.Data.SqlClient.SNI to 4.0.0 solved the problem... Thanks. They were buried among a couple other packages that I can't upgrade, that's why I didn't see that they were so outdated...
@MarvinNorway we're using 4.1.0 of SqlClient and Functions V4, deployed to Linux app service , still get this error - what kind of app is yours?
Microsoft.Data.SqlClient.SqlException: at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.TryRun (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.Run (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.RunParser (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsyncContinuedOnSuccess (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsyncContinued (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServerInternalRestContinuedAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServerInternalRestAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServerInternalAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteRowSourceToServerAsync (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlBulkCopy.WriteToServer (Microsoft.Data.SqlClient, Version=4.1.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
I am having this issue intermittently with a .net 6.0 app running in containers on Azure Kubernetes Service accessing an on prem SQL Server,
I am using Micrsooft.Data.SqlClient.dll v5.0.0.
The Microsoft.Data.SqlClient.SNI.dll is only in the directory runtimes\win-*\native so I assume it isn't going to get loaded
I have checked the SNAT Connection Count where the State is Failed and there are none.
We are genuinely considering either changing the hosting off linux hosts or more likely, swapping to a different DB tech as this is such a problem.
Experienced a flavor of this issue. When inserting singular records in intervals of a few seconds in concurrently running applications. Each application operating on their own instance of DbContext and no concurrent use inside their scope.
Packages Microsoft.Data.SqlClient 5.0.1 Microsoft.EntityFrameworkCore.SqlServer 6.0.10
Exception
An exception occurred in the database while saving changes for context type '"XXX.XXXX.XXXX.XXXContext"'.
Microsoft.EntityFrameworkCore.DbUpdateException: An error occurred while saving the entity changes. See the inner exception for details.
--->
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__207_0(Task`1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)---
End of stack trace from previous location --- at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken)ClientConnectionId:554eb334-af3c-400d-92df-1c1c1dd92d2bError Number:-2,State:0,Class:11 --- End of inner exception stack trace --- at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IList`1 entriesToSave, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(StateManager stateManager, Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()--- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
Client .NET 6.0 application implemented as BackgroundService.
Error occured on
_XContext.X.Add(x);
await _XContext.SaveChangesAsync(cancellationToken);
Last EF generated SQL command before this exception was thrown
SET NOCOUNT ON;
INSERT INTO [X] ([A], [B], [C], [D], [E], [F], [G])
VALUES (@p0, @p1, @p2, @p3, @p4, @p5, @p6);
SELECT [Z], [K] FROM [X]
WHERE @@ROWCOUNT = 1 AND [Z] = scope_identity()
Database SQL Server 2016
Resolution for now Optimizing expensive read queries as much as possible that affected the table. After optimizing heavier read queries, I haven't seen this error.
I had several applications concurrently using the database for one expensive read query each on startup, and then making inserts on same tables (not same records). Each had their own scoped instance of DbContext. The issue seemed to come when inserts were occurring to a table that was under a heavy read query by another process. Small percentage of inserts resulted in this error. Enabling retries didn't seem to help.
This sounds like a thread starvation issue with the system thread pool. This is known issue when running asynchronously with Microsoft.Data.SqlClient on .NET Core and it's in our backlog to make it fully asynchronous which will require a big refactor. You might be able to work around this by setting the ThreadPool.SetMinThreads to a high enough value to accommodate, but with many parallel tasks, there'll be more overhead to allow for continuation tasks inside of SqlClient to run before the timeouts are hit.
@lcheunglci - do you know of a good way to monitor the thread pool to determine if this is indeed thread starvation causing the issue?
Due to number 3) we would be surprised if the thread pool was starved as from our server logs not much was happening apart from one user using a grid at this time. We'd like to be able verify that though.
@lcheunglci Apologies, the error I am seeing is a 258, but it's not the "ReadSniSyncOverAsync" variety. It's actually this one here: https://github.com/dotnet/SqlClient/issues/1530 will move my comments there
When will this major years old linux issue be fixed? Using MS-SQL Docker on Linux fells like MySQL in the nineties.
My platform Infrastructure: Docker in Redhat Enterprise Language & Framework: .Net 6 with Entity Framework core Database: SQL Server
After optimizing my API and implementing 'async' and 'await' with 'Task', all the errors have disappeared. Also i set maximum and minimum pooling and connection life cycle
Could you clarify what do you mean here: "implementing 'asyn'c and 'await' with 'Task'" ? Was your API synchronized when that error appeared?
Could you clarify what do you mean here: "implementing 'asyn'c and 'await' with 'Task'" ? Was your API synchronized when that error appeared?
just i make my api asyncronized. Yes it was synchronized.
I had a very similar problem recently:
In my case I have a dockerized webapi where I start a new TransactionScope
whenever a PUT/POST/PATCH/DELETE call is received.
Depending on the operation, one or more connections are open to different Azure SQL databases via EF Core (and ultimately the SqlClient) - queries to the first connection will work, but any queries to other connections won't as I'm not using the Azure DTC preview.
Not creating a TransactionScope
solved my immediate issue in these instances..
The exception was the same @frankyuan's error and appeared intermittently in logs, due to application usage and behaviour:
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\n
---> System.ComponentModel.Win32Exception (258): Unknown error 258\n
....which was not a super helpful error message - while trying to figure this out 👎
Describe the bug
When executing SQL such as
SELECT FieldA, FieldB FROM A INNER JOIN C ON A.FieldId = C.FieldId UNION SELECT FieldA, FieldD FROM A INNER JOIN D ON A.FieldId = D.FieldId
, throw the error like below, not every time, just a little part of queries have this issue.Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\n ---> System.ComponentModel.Win32Exception (258): Unknown error 258\n at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction)\n at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)\n at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()\n at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()\n at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()\n at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Span1 buff, Int32 len, Int32& totalRead)\n at Microsoft.Data.SqlClient.TdsParser.TrySkipValue(SqlMetaDataPriv md, Int32 columnOrdinal, TdsParserStateObject stateObj)\n at Microsoft.Data.SqlClient.TdsParser.TrySkipRow(SqlMetaDataSet columns, Int32 startCol, TdsParserStateObject stateObj)\n at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)\n at Microsoft.Data.SqlClient.SqlDataReader.ReadAsync(CancellationToken cancellationToken)\n --- End of stack trace from previous location where exception was thrown ---\n at Dapper.SqlMapper.QueryAsync[T](IDbConnection cnn, Type effectiveType, CommandDefinition command) in //Dapper/SqlMapper.Async.cs:line 437\n
To reproduce
Sorry, currently can't reproduce in the local environment, so can't provide more detail to reproduce.
Expected behavior
SQL should execute successfully every time.
Further technical details
Microsoft.Data.SqlClient version: 1.1.3 .NET target: Core 3.1 Operating system: Docker container
What I found/tried
Were you able to resolve this?
Minimizing Task.Run(...) calls did not solve the problem in my case. Looks like this is a regular threading bug.
Read the other day of a SNI threading bugfix in a ef core preview version (can't find it right now). Hope this will solve this leaking problem soon.
Still receiving this error quite often on .NET 7 running in a Linux (Debian) container. Any update on what might be the cause?
@Malgefor Convert all asynchronous SQL calls into synchronous ones, and everything will be sorted :)
We are also experiencing this issue with Linux App Services against Azure SQL. We only began seeing this after migrating our App Services from Windows to Linux. Also on 5.1.1 of SqlClient.
I spent a few hours fighting this error. In my case, it was really just a slow transaction that timed out (I use the default CommandTimeout).
My app runs in a dotnet 6.0-bullseye-slim Docker container. I use NServiceBus to handle incoming messages, up to 25 concurrently. When a message arrives, I use EF Core 7.0 to fetch a record from the database using SingleOrDefaultAsync
and then save edits to the record using SaveChangesAsync
. Sometimes this takes longer than the command timeout. It's a bit unfortunate that timeouts result in an Unknown Error, which I guess is a Linux-only problem. (Why?)
The stacktrace doesn't always look the same. Here is another example without ReadSniSyncOverAsync
:
A side-effect of this Unknown Error seems to be that the connection is NOT (always?) returned to the connection pool. I think this has also been discussed previously as a side-effect of using sync-over-async, but I don't understand the machinery that leads to the connection remaining open but unavailable for reuse.
Connections not being returned to the pool causes my application to come to almost a complete stop, with 100 open connections left unused. I have confirmed on SQL Server that my application has 100 active sessions, yet my message handlers are all crashing with a timeout due to connection pool starvation.
In the end, I just ended up rewriting my queries and adding indexes, so the timeouts disappeared. But it still seems to me that there is something not really working right in the SqlClient on Linux.
This issue has been open for over 3 years and I think that there are probably a lot of people who would be really appreciative of an update on this, even if it is just to say this isn't getting fixed any time soon.
We experienced this issue about a year ago. We ended up shifting our hosting to Windows, at additional cost, to be sure that we wouldn't encounter it again. I came back here a year later expecting (well probably more like hoping) that it would have been resolved, or at least for some sort of update.
IMO, this is a massive issue for a huge number of teams wanting to host on Linux. I'd probably go as far as saying that you shouldn't really be hosting on Linux in a production environment if you are using SqlClient. That's a lot of people and a really big deal. As far as I can tell, any resolutions anyone has above are just workarounds. They are not really fixing the issue. If I'm wrong, please let me know.
Of the above workarounds, is there one that will totally prevent the issue from occurring, please?
Thanks in advance for any feedback.
I would like to share my experience on this. I used to get exactly the same exception/stack trace when running a query. It was odd because running the execution plan it always showed super optimized, it seemed not a performance problem, but often the exception occurred. We had no idea what to do. But since a time ago, we are rebuilding indexes every week, and exception was not raised anymore. My conclusion is, sometimes SQL server was doing something, or some overload in the SQL server but for any reason refreshing indexes reduces these overloads.
For us, increasing the threadpool threads eliminated this issue. We're still on linux and are not seeing this (anymore)
We also started seeing unknown error 258 after migrating from Windows to Linux App Services, but strangely only on one of our environments.
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__209_0(Task`1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.<PopulateSplitIncludeCollectionAsync>g__InitializeReaderAsync|27_0[TIncludingEntity,TIncludedEntity](RelationalQueryContext queryContext, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.PopulateSplitIncludeCollectionAsync[TIncludingEntity,TIncludedEntity](Int32 collectionId, RelationalQueryContext queryContext, IExecutionStrategy executionStrategy, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, SplitQueryResultCoordinator resultCoordinator, Func`3 childIdentifier, IReadOnlyList`1 identifierValueComparers, Func`5 innerShaper, Func`4 relatedDataLoaders, INavigationBase inverseNavigation, Action`2 fixup, Boolean trackingQuery)
at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.TaskAwaiter(Func`1[] taskFactories)
at Microsoft.EntityFrameworkCore.Query.Internal.SplitQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
@MichelZ Out of curiosity, what did you increase the minimum threads to?
We also started seeing unknown error 258 after migrating from Windows to Linux App Services, but strangely only on one of our environments.
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (258): Unknown error 258 at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__209_0(Task`1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.<PopulateSplitIncludeCollectionAsync>g__InitializeReaderAsync|27_0[TIncludingEntity,TIncludedEntity](RelationalQueryContext queryContext, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext() --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.PopulateSplitIncludeCollectionAsync[TIncludingEntity,TIncludedEntity](Int32 collectionId, RelationalQueryContext queryContext, IExecutionStrategy executionStrategy, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, SplitQueryResultCoordinator resultCoordinator, Func`3 childIdentifier, IReadOnlyList`1 identifierValueComparers, Func`5 innerShaper, Func`4 relatedDataLoaders, INavigationBase inverseNavigation, Action`2 fixup, Boolean trackingQuery) at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.TaskAwaiter(Func`1[] taskFactories) at Microsoft.EntityFrameworkCore.Query.Internal.SplitQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
@MichelZ Out of curiosity, what did you increase the minimum threads to?
We went aggressive/overkill, but we haven't seen negative side effects currently.
ThreadPool.GetMaxThreads(out int workerThreads, out int completionPortThreads);
ThreadPool.SetMaxThreads(Math.Max(workerThreads, 2048), Math.Max(completionPortThreads, 2048));
ThreadPool.SetMinThreads(1024, 512);
In my last post I guessed the 258 error occurs because of full tcp buffers and wrong tcp socket handling on linux (see my previous post above). Further investigation made me more confident that this is the case.
The additional queue I added (last post) helped a bit, but after adding more features (service implementations) to our host process the error came back quite frequently.
Because most 258 errors occured in DbContext.SaveChanges() I synchronized calls in my DbContext base implementation like this:
public class MyBaseDbContext : DbContext { private static readonly object syncObj = new object(); public override int SaveChanges() { lock (syncObj) { return base.SaveChanges(); } } }
This dirty code slowed down my execution flow and helped reducing the 258 errors. But like others reported here about disabling MARS it did not solve the problem completly.
I examined that the problem occurs often during extensive external web API rest calls. So I synchronized them as well:
static SemaphoreSlim semaphoreSlimGeocodeLock = new SemaphoreSlim(1, 1); public void MyRestCallMethod() { try { await semaphoreSlimGeocodeLock.WaitAsync(); HttpClient httpClient = new HttpClient(); string result = await httpClient.GetStringAsync(url); ... } finally { httpClient.Dispose(); semaphoreSlimGeocodeLock.Release(); } }
No parallel web calls helped a lot and made me sure the error is related to the current tcp buffer usage within the application.
I guess the error originates in some way like this:
- SqlClient reads/writes to the sql server tcp socket connection.
- Because of the buffer handling bug the read/write loop returns indicating the remotehost has terminated the connection.
- SqlClient internal reconnect feature reconnects immediately to the SQL Server.
- On the next tcp read/write the same error occures because the application tcp buffer is still full.
- This is repeated many times and there might be a threading issue playing a role as well.
- The application is freezed and timeout "Unknwon error 258" occurs
Slowing down the execution flow and hoping not reaching tcp buffer limits is not a production ready workaround :) I wonder if this is a common problem for all high tcp i/o apps running on Linux and connecting to Sql Server?
This error occured some time after upgrading ef core (not sure which .net core/ef core version). My application runs currently on:
- Bare metal server: AMD Ryzen 7 2700X
- Ubuntu 18.04.2 LTS
- .NET 5.0
- Microsoft.EntityFrameworkCore.SqlServer 5.0.1
Used SQL Server: Latest SQL Server 2019 Docker container
Side notes:
- Upgrading from a two years old SQL Server 2017 Container to the latest 2019 version did not change anything.
- I did try collecting traces with
dotnet-trace collect --process-id 27455 --profile database
but the results did not include the expected events.
Thought I'd add that when we investigated this a year ago, we felt that analysis from @bennil, above, married up with what we were seeing.
Not sure I'm comfortable with your solution @MichelZ even though you haven't suffered any negative consequences. Great it is working for you though.
Thanks @co7e for bringing this matter up again.
My insight after watching/analyzing this for years are the following:
SqlClient is not thread safe on Linux (saw it on Intel and AMD). Reducing parallel work or adjust thread execution might help but not solve the problem. Sooner or later the exception will hit you.
Surprisingly Microsoft doesn't care for years even though a huge number of enterprise software is running on this db access lottery.
The bug is hard to reproduce as all threading issues are. I have some ideas how to construct testcode for further investigations, but this is not a charity project...
We also started seeing unknown error 258 after migrating from Windows to Linux App Services, but strangely only on one of our environments.
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (258): Unknown error 258 at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__209_0(Task`1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.<PopulateSplitIncludeCollectionAsync>g__InitializeReaderAsync|27_0[TIncludingEntity,TIncludedEntity](RelationalQueryContext queryContext, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext() --- End of stack trace from previous location --- at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken) at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.PopulateSplitIncludeCollectionAsync[TIncludingEntity,TIncludedEntity](Int32 collectionId, RelationalQueryContext queryContext, IExecutionStrategy executionStrategy, RelationalCommandCache relationalCommandCache, IReadOnlyList`1 readerColumns, Boolean detailedErrorsEnabled, SplitQueryResultCoordinator resultCoordinator, Func`3 childIdentifier, IReadOnlyList`1 identifierValueComparers, Func`5 innerShaper, Func`4 relatedDataLoaders, INavigationBase inverseNavigation, Action`2 fixup, Boolean trackingQuery) at Microsoft.EntityFrameworkCore.Query.RelationalShapedQueryCompilingExpressionVisitor.ShaperProcessingExpressionVisitor.TaskAwaiter(Func`1[] taskFactories) at Microsoft.EntityFrameworkCore.Query.Internal.SplitQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
@MichelZ Out of curiosity, what did you increase the minimum threads to?
Your stack trace doesn't contain Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
- be aware that error 258 is a fairly broad error, see my comment https://github.com/dotnet/SqlClient/issues/647#issuecomment-1348177080
In our case the problem has now as mysteriously disappeared as it has mysteriously appeared. The only change we made was migrate to a different virtual machine than the previous SQL Server was running on.
By taking some packet captures by logging in with ssh on the Linux Web App, and typing ip a
to get the interface name and then tcpdump -i vnet0g6qkt5hl -w ~/network_traces/0001.pcap
we noticed something strange while the problem was still occurring.
A packet containing a SQL query that in the tcpdump was registered with size 2130 bytes and data length 2064 was being "partially acked". After sending the large packet out, the ACK came back only for the first 1398 bytes, causing a retransmission of the last 666 bytes. However, in response to this retransmission only came another ACK for only the first 1398 bytes, causing another retransmission of the last 666 bytes, and so on and so forth, until eventually the connection timed out. Rather than a threading issue this potentially seems like some kind of network layer problem. However, we can no longer reproduce it, so it's difficult to investigate more deeply.
As @dazinator already suggested our particular problem was probably something entirely different than what this ticket is about. The 258 timeout error is fairly general - it just means that some kind of wait operation timed out and it can have multiple different underlying causes. Sorry for the noise.
In our case the problem has now as mysteriously disappeared as it has mysteriously appeared. The only change we made was migrate to a different virtual machine than the previous SQL Server was running on.
By taking some packet captures by logging in with ssh on the Linux Web App, and typing
ip a
to get the interface name and thentcpdump -i vnet0g6qkt5hl -w ~/network_traces/0001.pcap
we noticed something strange while the problem was still occurring.A packet containing a SQL query that in the tcpdump was registered with size 2130 bytes and data length 2064 was being "partially acked". After sending the large packet out, the ACK came back only for the first 1398 bytes, causing a retransmission of the last 666 bytes. However, in response to this retransmission only came another ACK for only the first 1398 bytes, causing another retransmission of the last 666 bytes, and so on and so forth, until eventually the connection timed out. Rather than a threading issue this potentially seems like some kind of network layer problem. However, we can no longer reproduce it, so it's difficult to investigate more deeply.
As @dazinator already suggested our particular problem was probably something entirely different than what this ticket is about. The 258 timeout error is fairly general - it just means that some kind of wait operation timed out and it can have multiple different underlying causes. Sorry for the noise.
We have also experienced the same issue.
The solution, for us, was to limit the packet size in the Connection string by appending ;Packet Size=512
. We chose 512
as it is the minimum size.
See https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.packetsize?view=dotnet-plat-ext-8.0 for more info on SqlConnection.PacketSize property.
https://www.codeproject.com/Tips/5378079/Challenges-in-Migrating-ASP-NET-Apps-to-Containers
Maybe after all it relates to MARS? (DB connection string: MultipleActiveResultSets=True)
https://www.codeproject.com/Tips/5378079/Challenges-in-Migrating-ASP-NET-Apps-to-Containers
Maybe after all it relates to MARS? (DB connection string: MultipleActiveResultSets=True)
Not in mine or many other cases. The MARS issue was seperate as far as I know. I use MultipleActiveResultSets=False
and still saw this problem.
Note: After moving the work to a seperate dedicated process (and therefore removing contention of the application thread pool in the process) I no longer see this issue. The new process starts - on the same machine, runs the same queries, and terminates as expected. Also, not only are we running this logic in a dedicated process, we have 10 jobs to run, rather than them all being run in application at same time as before, they are now 10 seperate dedicated processes that start then terminate. So this has resulted in 11 seperate processes, each with own thread pool and using own connection.
So it seems to be to do with:-
I also wondered if Azure SQL has some sort of connection "rejection" mechanism, such that when it scales up or CPU usage is high, it can reject queries with a 258 "immediately" and that perhaps when a connection is in this state, it doesn't clear. A bit like a rate limiting feature. However this is pure speculation, I don't have the networking skills to trace what packets is being sent when this occurs, as our apps run in docker, on linux azure VM's and things get pretty complicated pretty fast.
Describe the bug
When executing SQL such as
SELECT FieldA, FieldB FROM A INNER JOIN C ON A.FieldId = C.FieldId UNION SELECT FieldA, FieldD FROM A INNER JOIN D ON A.FieldId = D.FieldId
, throw the error like below, not every time, just a little part of queries have this issue.To reproduce
Sorry, currently can't reproduce in the local environment, so can't provide more detail to reproduce.
Expected behavior
SQL should execute successfully every time.
Further technical details
Microsoft.Data.SqlClient version: 1.1.3 .NET target: Core 3.1 Operating system: Docker container
What I found/tried
https://stackoverflow.com/questions/57270245/sql-server-dbcommand-timeout-with-net-core-container-under-load
https://github.com/StackExchange/Dapper/issues/1435