dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
836 stars 278 forks source link

Execution Timeout Expired Error (258, ReadSniSyncOverAsync) #647

Closed frankyuan closed 3 months ago

frankyuan commented 4 years ago

Describe the bug

When executing SQL such as SELECT FieldA, FieldB FROM A INNER JOIN C ON A.FieldId = C.FieldId UNION SELECT FieldA, FieldD FROM A INNER JOIN D ON A.FieldId = D.FieldId, throw the error like below, not every time, just a little part of queries have this issue.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\n
---> System.ComponentModel.Win32Exception (258): Unknown error 258\n
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction)\n 
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()\n 
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Span1 buff, Int32 len, Int32& totalRead)\n
at Microsoft.Data.SqlClient.TdsParser.TrySkipValue(SqlMetaDataPriv md, Int32 columnOrdinal, TdsParserStateObject stateObj)\n
at Microsoft.Data.SqlClient.TdsParser.TrySkipRow(SqlMetaDataSet columns, Int32 startCol, TdsParserStateObject stateObj)\n
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n
at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)\n
at Microsoft.Data.SqlClient.SqlDataReader.ReadAsync(CancellationToken cancellationToken)\n
--- End of stack trace from previous location where exception was thrown ---\n
at Dapper.SqlMapper.QueryAsync[T](IDbConnection cnn, Type effectiveType, CommandDefinition command) in //Dapper/SqlMapper.Async.cs:line 437\n

To reproduce

Sorry, currently can't reproduce in the local environment, so can't provide more detail to reproduce.

Expected behavior

SQL should execute successfully every time.

Further technical details

Microsoft.Data.SqlClient version: 1.1.3 .NET target: Core 3.1 Operating system: Docker container

What I found/tried

https://stackoverflow.com/questions/57270245/sql-server-dbcommand-timeout-with-net-core-container-under-load

https://github.com/StackExchange/Dapper/issues/1435

jslaybaugh commented 3 years ago

We're having similar issues. It happened out of the blue last week and then again this morning. Suddenly these timeouts start piling up and our DTU's spike to 100% and stay there.

Unfortunately, the only thing that seems to fix this for us is manually scaling up our tier in SQL Azure. We generally run at a P4. So we simply bump it to the next one (P6) and then its fine, even when we immediately bump it back down to the P4. It is like something just gets "stuck" and there is no way to remedy it without scaling the SQL server. Which sucks because who knows when it is going to happen. Here's a sample stack trace:

Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
at Microsoft.Data.SqlClient.SqlCommand.<>c.b__188_0(Task`1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
--- End of stack trace from previous location where exception was thrown ---
at Dapper.SqlMapper.QueryMultipleAsync(IDbConnection cnn, CommandDefinition command) in /_/Dapper/SqlMapper.Async.cs:line 1013
at Data.BaseDB.QueryMultipleAsync(String sql, Object param, Boolean readReplica, Nullable`1 commandTimeout) in /src/src/Data/BaseDB.cs:line 77
at Data.ReportsDB.PaymentMethodBreakdownAsync(String clientKey, DateTime startDate, DateTime endDate, List`1 events, List`1 paymentMethods) in /src/src/Data/ReportsDB.cs:line 728

Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Unknown error 258

Here's an example of the DTU's.... everything is trucking along normal and then it starts to go nuts and then around the time in the morning that business starts picking up, we're at 100% till we finally bounce the tier and then it more or less goes back to normal 😖

Screen Shot 2021-03-16 at 10 55 31 AM

Also opened a ticket with Azure support to see if they can help diagnose from their end.

brjasha commented 3 years ago

I have found reason for my case. I'm sorry, but that were number of bad performing queries - doing either index or table scan with several dozen millions rows. It was really hard to catch, as those happened 1-2 times per hour and were not affecting anyhow used memory (30% free) or CPU (max 20% usage). Those scans were overloading disk system and harming SQL server caches. So even after query run there were problems in next 5-10 minutes - due to missing data in cache, even for usually good performing queries SQL server had to read too much things from disk and therefore queries were timing-out.

Guys, check your queries. This SQL script gives you those, which had to read lot of data `SELECT SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1, ((CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END

JRahnama commented 3 years ago

@brjasha thanks for the update.

cheenamalhotra commented 3 years ago

That's a great point you brought up @brjasha

It's totally possible that the queries themselves are not designed well and take up long time to execute on server side. Sometimes it's due to messed up indexing, or queries that have a very complex execution plan or SQL caching issues as in this case. Optimizing queries can solve majority of the issues.

So, unless the problem is triggered due to Threadpool starvation (when 1000s of connections are parallelly trying to query and client is not able to read all results on time - which we know is an existing issue #422) and queries are also timing out in normal executions, they probably need optimization. And an Azure/SQL DB Administrator should be able to address those kind of issues.

andrehaupt commented 3 years ago

Translated by Google: I changed the multiplyActiveResultSets in the db connection to false, and the problem disappeared completely

@niuniu007 Yes this issue exists for MARS enabled connections only.

We are experiencing this on connection with MARS disabled.

Will try the new MDS client.

andrehaupt commented 3 years ago

@JRahnama Do you know if the same issue that was fixed in MDS 2.1.2 was present in 1.1.x? We ran 1.1.2-4 and encountered the same problem.

JRahnama commented 3 years ago

@andrehaupt the issue was ported to MDS from the beginning. it was related to the internal timer problem which was causing wrong results in async calls. That issue existed in 1.x versions and only got fixed in the very latest release of the driver (v2.1.2 and v1.1.4)

cheenamalhotra commented 3 years ago

@andrehaupt

What difference of behavior do you see between v1.1.4 and v2.1.2? The wrong data issue (#659) was a critical one and was fixed in both versions, so I'm not sure if you're looking at same issue for validation.

Could you explain more and provide a repro to your case if possible?

andrehaupt commented 3 years ago

@cheenamalhotra Will monitor 1.1.4 in prod tomorrow and revert with more detail. Loads are low at the moment, so not experiencing any issues. Yes, I can share a private repo.

andrehaupt commented 3 years ago

@cheenamalhotra Unfortunately, this happened again for us in 1.1.4. It happens at random calls, ranging from SELECT 1 statements to complex queries. I shall share the repo, cap file and Azure support thread in a DM.

The stack trace is consistent though:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransaction(IsolationLevel isolationLevel)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
jslaybaugh commented 3 years ago

@cheenamalhotra Unfortunately, this happened again for us in 1.1.4. It happens at random calls, ranging from SELECT 1 statements to complex queries. I shall share the repo, cap file and Azure support thread in a DM.

The stack trace is consistent though:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransaction(IsolationLevel isolationLevel)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)

@andrehaupt We're also working with Azure and our issues seem to be getting worse. We used to be able to fix this by scaling our Azure SQL up/down and it would go away. Now we are scaled WAY up and we're still having issues and we're unable to track anything down. Like you say, 258 seems to affect simple statements and complex queries alike.

If you are willing, and think it might have helpful information, could you DM me your Azure support thread as well?

cheenamalhotra commented 3 years ago

Please provide us a repro by mocking your table schemas, in order to reproduce. This error can range due to various reasons, sometimes due to server side processing delays as explained here: https://github.com/dotnet/SqlClient/issues/647#issuecomment-801292170

If it's confirmed this is a client side delay, we'd need a repro app to investigate further.

andrehaupt commented 3 years ago

@jslaybaugh / @cheenamalhotra Happy to share, but I've noticed that we are encountering this both in the System.Data.SqlClient and Microsoft.Data.SqlClient libraries. We are still troubleshooting with Azure Support. They are not able to see any issues in network traces we've sent, but regardless, I believe the issue to be environmental.

andrehaupt commented 3 years ago

@jslaybaugh / @cheenamalhotra: Quick Update:

The issue is environmental.

If you ssh onto the AKS node and have a look at the /var/log/syslog (or syslog.1) you should see "eth0: Lost carrier" messages. It appears to be related to this Azure issue. A fix is being rolled out at the moment.

uvw commented 3 years ago

@andrehaupt

We stumbled with a very similar issue on AKS a few days ago and reached out to Azure support. It seems environmental to me too but Azure support said that they are not rolling out any AKS fixes for this right now. They said that https://github.com/Azure/aks-engine/issues/4341 is not related to AKS because it is about AKS-Engine.

I hope you are right though.

As a temporary workaround, we moved the affected workloads to Windows containers and it helped.

andrehaupt commented 3 years ago

@uvw We haven't had any issues the past day, so it was definitely the issue.

luarvic commented 3 years ago

Hi there. I experience the same issue. My containerized .Net Core (3.1) application runs on Linux and uses Microsoft.Data.SqlClient (2.1.2) to query on-premises SQL Servers. Under heavy load (multiple simultaneous connections) the application throws the following errors:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__164_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---

The SQL query is simple and lightweight, so I don't think it's a matter of a SQL Server bad performance.

andrehaupt commented 3 years ago

@Rosti81 It does not seem to be the same issue as the ReadSniSyncOverAsync call isn't in your stack trace.

One issue that has been resolved in version 2.1.4 is the location where error 258 pops up. It might be that other queries are timing out and that it is reported at this location in the code. Try to upgrade and see if the error persists.

luarvic commented 3 years ago

Hi @andrehaupt Have the version 2.1.4 been released? I cannot see it in releases. The last stable one is 2.1.2.

andrehaupt commented 3 years ago

My bad. Confused it with 1.1.4. You are on the latest then. Apologies.

luarvic commented 3 years ago

I'm going to try the last prerelease that includes #920 fix that resolves similar issue #672. I'll update you on the result.

DirkBoer commented 3 years ago

Hi @Rosti81 - I'm having the same problems right now - also on a server with quite a heavy load - do you still have these problems?

andrehaupt commented 3 years ago

@DirkBoer It was purely network related in our case. We added a monitoring job that runs every 30 seconds to help us detect network issues.

Give this bit of code a bash and see if this picks up any issues:

private const int ConnectTimeout = 10000;  // 10 seconds

private void OpenRemoteHostPort(string hostName, int hostPort, string hostDescription)
{
    try
    {
        using(var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp))
        {
            IPAddress[] remoteHosts = Dns.GetHostAddresses(hostName);

            if (remoteHosts.Length < 1)
                throw new Exception("Unable to get remote host IP address");

            IAsyncResult result = socket.BeginConnect(remoteHosts.First(), hostPort, null, null);
            bool success = result.AsyncWaitHandle.WaitOne(ConnectTimeout, true);

            if (!success)
            {
                socket.Close();
                throw new SocketException(10060); // Connection timed out
            }

            socket.EndConnect(result);
        }
    }
    catch (Exception ex)
    {
        Logger.Error($"Unable to establish a socket connection to {hostDescription} ({hostName}:{hostPort})", ex);
    }
}
DirkBoer commented 3 years ago

Hi @andrehaupt - what kind of network issues? I'm running almost everything in Azure and the majority of connections is to SQL Azure - all out of the box. What kind of host and port were you testing here? Did you had timeouts to other services that depleted something like an outbound connection pool?

andrehaupt commented 3 years ago

@DirkBoer - We also had a false sense of security that everything out of the box will just work. See this issue where the VMs in our k8s cluster had intermittent DHCP issues: https://github.com/Azure/aks-engine/issues/4341. This has now been resolved.

We thought the problem was related to load, but in fact the problem was there regardless of load. It just popped up in the logs more often as the hit rate was higher with more activity going on in the system during peak hours.

The host is simply your Azure SQL Server host name and the port is typically 1433.

From experience, it is quite unlikely that the connection pools will be to blame, unless you tweak those yourself or you have an extremely busy system. The key for us was to look carefully at the full stack trace. Error 258 gets throw for a bunch of reasons, so one has to dig a bit deeper. The following item indicated that there was an issue with the Server Name Indication component (caused by the DHCP resolution problem):

Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()

If you could paste your stack trace here, that might be helpful in finding the issue.

DirkBoer commented 3 years ago

@andrehaupt - see below (I removed my application logic).

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__164_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
   at Microsoft.EntityFrameworkCore.Query.ShapedQueryCompilingExpressionVisitor.SingleOrDefaultAsync[TSource](IAsyncEnumerable`1 asyncEnumerable, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.ShapedQueryCompilingExpressionVisitor.SingleOrDefaultAsync[TSource](IAsyncEnumerable`1 asyncEnumerable, CancellationToken cancellationToken)

But also more weird ones during that period, like this


System.InvalidOperationException: The underlying reader doesn't have as many fields as expected.
   at Microsoft.EntityFrameworkCore.Query.Internal.BufferedDataReader.BufferedDataRecord.InitializeFields()
   at Microsoft.EntityFrameworkCore.Query.Internal.BufferedDataReader.BufferedDataRecord.InitializeAsync(DbDataReader reader, IReadOnlyList`1 columns, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.BufferedDataReader.InitializeAsync(IReadOnlyList`1 columns, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
   at Microsoft.EntityFrameworkCore.Query.ShapedQueryCompilingExpressionVisitor.SingleOrDefaultAsync[TSource](IAsyncEnumerable`1 asyncEnumerable, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.ShapedQueryCompilingExpressionVisitor.SingleOrDefaultAsync[TSource](IAsyncEnumerable`1 asyncEnumerable, CancellationToken cancellationToken)
   at Fork.Extensions.IIdExtensions.ByIdAsync[TEntity](IQueryable`1 items, Int32 id) in C:\Repo\Main\Main\SharedCore\Fork.Sql\Extensions\IId.cs:line 151

This is code that hasn't been touched for a year and now is suddenly working perfectly well as expected. Could it be that something was happening to Azure SQL Server during this time? The weird thing was that other services on the same server did seem to be still available.

I have a website that will hit around the 20.000 concurrent users during the euro cup (in two weeks!) so it does worry me.

andrehaupt commented 3 years ago

@DirkBoer Not sure why this is happening, but it is definitely not related to this thread as the issue here was the SNI stuff.

It might be version related. See this: https://stackoverflow.com/questions/63077018/azure-sql-single-database-dtu-timeout-expired-error

It might also just be a legit timeout. See if the SQL server has enough resources and also whether the specific queries are not timing out. Also validate your command timeout. We set it up like this:

public IServiceProvider ConfigureServices(IServiceCollection services)
{
    services.AddDbContext<MyDbContext>(options => 
        options.UseSqlServer(_appConfiguration.GetConnectionString(MyConnectionStringName),
            sqlServerOptionsAction => 
            {
                sqlServerOptionsAction.CommandTimeout(CommandTimeout);  // In seconds
                sqlServerOptionsAction.EnableRetryOnFailure();
            }
        )
    );
}

Other than this, I'm not sure what is going on.

cheenamalhotra commented 3 years ago

@DirkBoer

Have you tested with v2.1.3 or v3.0.0-preview3? This error should not occur IMO with new versions.

System.InvalidOperationException: The underlying reader doesn't have as many fields as expected.
DirkBoer commented 3 years ago

Hi @cheenamalhotra,

I added v2.1.3 to my project but it is still happening for :( How can I know for sure it is being used properly? I quite recently moved to .NET Core so I have to get used to the (thankfully very simplified) package management.

No Microsoft.Data.SqlClient was installed before though - so I assume it was coming from some central place or through another package dependency? How can I know it is now being used by Entity Framework and Dapper?

DirkBoer commented 3 years ago

Hi @cheenamalhotra, as I have a really strict deadline I just panic-upgraded my whole project to .NET 5 - do you know if this should also be fixed in .NET 5?

cheenamalhotra commented 3 years ago

@DirkBoer if it occurs with v2.1.3 as well, upgrading to .NET 5 may not help.

Please open a new issue and provide detailed information like connection string, environment details, APIs used and their sequence for us to investigate.

If possible, please wrap up a repro for the issue you're seeing so we can reproduce at our side as well to speed up investigations.

DirkBoer commented 3 years ago

Hi @cheenamalhotra , thanks for your quick reply! As this is a really large project and it only happens during high (natural) traffic I have not time to try to separate this into a smaller test sample.

Do you know if this error ever occured for the Windows Azure version? Otherwise I might switch back to Windows as an emergency. I have a project for the EURO 2020 that starts next week so I'm quite in a tight deadline 😅

DirkBoer commented 3 years ago

Hi @cheenamalhotra, sorry for the confusion.

The The underlying reader doesn't have as many fields as expected I actually haven't seen with the new version - but I still do see the Error 258 ReadSniSyncOverAsync

Now I suspect that actually it might happen that there might have been a short spike in SQL traffic at a certain point - but I feel (despite that the database is accessible again) that the SqlConnections keeps being stuck in this Error 258 state forever.

PenzinAlexander commented 3 years ago

Hi guys. Any news on this issue? We are using .NET application in Docker and MSSQL Server 2019. When the load is increased, we see a similar error. We tried to upgrade to version 1.1.4 (if I'm not mistaken). But the error still persists. Will this bug be fixed in the next versions, or is it worth waiting for the release of .NET 5?

Thanks. Appreciate your answers

bennil commented 3 years ago

In my case it looks like the problem occurs if other Socket.ReadAsync() can not spin fast enough to clear receive buffers on linux. On Windows if the socket receive buffer is full, the TCP ACK package will not be send back to the transmitter and the dataflow is slowed. On Linux if the socket receive buffer is full, the connection is closed with wrong info: "was forcibly closed by the remote host".

The following blog post revealed it to me: https://www.codeproject.com/Articles/5307011/Stop-Saving-to-Disk-with-Csharp-Asynchronous-Strea

I added a Channel<T> queue in my receiver task to let Socket.ReadAsync spin freely. Afterwards I did not see any random tcp connection losts in my communication infrastructure (I do not use http) and also no more SQL Server timeouts. Increasing the receive buffer size might also help as workaround.

Disclaimer: My insight is quite new. I need more tests and log analysis over time to verify my assertions.

cheenamalhotra commented 3 years ago

Hi @PenzinAlexander @bennil

I'd request you to please identify if your issue is related to #422 (WIP) or #659 (fixed in patched drivers) instead.

If not and your error is reproducible easily, please re-evaluate your queries and server indexes in standalone repro apps. I would also highly recommend testing your case with latest driver Microsoft.Data.SqlClient v3.0.0 to get benefit from fixes already done. If queries do not complete on time "genuinely" due to large row data being read, you may want to modify queries or increase "CommandTimeout" if required to do so.

Please do not hesitate to open a new issue with a repro if it turns out a different issue you're facing that we should investigate.

Will-Bill commented 2 years ago

Is this happening on Azure platform for everyone? Getting same issue with and without MARS, Linux App Service (docker) + Azure SQL.

Related: https://github.com/Azure/aks-engine/issues/4341 ?

matiasjuliobello commented 2 years ago

Same here. Getting the "unknown error 258" intermittently (but quite often).

Technical details:

What I've tried:

Nothing worked.

Just to mention this (in case it helps), if I run the same container in my local machine and execute the code, it works like a charm. That being my code opening a connection against the Db engine straight-forwardly ( connected through a vpn to reach the sql server 2012 in a VM ). The big difference is that the app in production environment is behind an Application Gateway and API Management, and in the end, there's Istio routing requests there.. If a request goes through that path, sometimes ends OK and sometimes we get the Sql Exeption stating the error above. I've tried directly hitting the Pod via its IP to be as straight-forward as possible, but same results (my guess is that Istio is messing it up, but that is just a hunch)

quanmaiepi commented 2 years ago

Just want to add that we significantly reduced the chance of running into this problem by not using MARS. Previously we had same issue with or without MARS, but it turned out that we are using KeyVault and the secrets are cached for 24h. Only adding a dummy change to appSettings and saving it would clear the cache.

cheenamalhotra commented 2 years ago

@matiasjuliobello

Could you share your app code (minimal repro) to verify the use case and assist?

niuniu007 commented 2 years ago

一次我观测到服务器cpu 使用了100% 后大量报告此错误

cheenamalhotra commented 2 years ago

@niuniu007

[Translated text]

This error was reported in large numbers after I observed that the server cpu was 100% used at a time

We have noticed the server side behavior in case of Bulk Copy, could you please confirm using Bulk Copy operations triggers this?

niuniu007 commented 2 years ago

@niuniu007

[Translated text]

This error was reported in large numbers after I observed that the server cpu was 100% used at a time

We have noticed the server side behavior in case of Bulk Copy, could you please confirm using Bulk Copy operations triggers this?

并不是,是因为我使用update 锁了整个表才导致cpu 100%的

cheenamalhotra commented 2 years ago

[Translated text]

No, it's because I used update to lock the entire table that I'm 100% of the cpu

Technically it's the same issue, as the use case we have seen recently also requires a locked table. And yes we have seen this issue also with SELECT and DML statements. When the table is locked and Bulk copy/SELECT is performed, we see that server side memory consumption increases and connections are not released by server. Bulk Copy issue is due to a server side bug and it is being investigated, while SELECT/DML is yet to be confirmed.

As a workaround, you can bump up server's max worker threads that will allow other pending tasks to occur and close connections that are to be closed.

niuniu007 commented 2 years ago

[Translated text]

No, it's because I used update to lock the entire table that I'm 100% of the cpu

Technically it's the same issue, as the use case we have seen recently also requires a locked table. And yes we have seen this issue also with SELECT and DML statements. When the table is locked and Bulk copy/SELECT is performed, we see that server side memory consumption increases and connections are not released by server. Bulk Copy issue is due to a server side bug and it is being investigated, while SELECT/DML is yet to be confirmed.

As a workaround, you can bump up server's max worker threads that will allow other pending tasks to occur and close connections that are to be closed. 抱歉,我的猜测错误了,这跟cpu100%没有关系。 我在预生产复现了这个错误。 复现过程分享: 一张表中有70w条记录,每个请求都取更新整个表(其中的一个字段),使用1000个user进行压测,很轻松的复现了这个错误

查看图片:

linuxesabz commented 2 years ago

Hi guys. We are using .NET 5 application in Docker and MSSQL Server 2019. When the load is increased, we see a similar error. we noticed this problem when Run Application On Docker and Linux But On Windows everything OK

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__169_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() in /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs:line 190
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 215
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs:line 2352
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
   at Microsoft.EntityFrameworkCore.EntityFrameworkQueryableExtensions.ToListAsync[TSource](IQueryable`1 source, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.EntityFrameworkQueryableExtensions.ToListAsync[TSource](IQueryable`1 source, CancellationToken cancellationToken)
   at IFS.Red.Services.AlertService.ReadOneHundredOfScheduledAlerts() in C:\agent4\_work\7\s\RedSolution\IFS.Red.Services\AlertService.cs:line 209
   at IFS.Red.Services.AlertService.ProcessScheduledAlert() in C:\agent4\_work\7\s\RedSolution\IFS.Red.Services\AlertService.cs:line 189
   at IFS.Red.Jobs.ProcessScheduledAlertJob.Run() in C:\agent4\_work\7\s\RedSolution\IFS.Red.Jobs\ProcessScheduledAlertJob.cs:line 22
ClientConnectionId:32c52d45-9e41-4c5a-9da2-d7306d841e84
Error Number:-2,State:0,Class:11
Will-Bill commented 2 years ago

I am now getting this error in Windows as well as Linux - however in windows it happens much less it seems.

.NET 5 connecting to Azure SQL database

JRahnama commented 2 years ago

@Will-Bill we are eager to see a stand alone application that can reproduce the issue as this issue may come from many different sources. Will you be able to make one? Can anyone else provide us a repro? We are not able to repro the issue, hence we are stopped there. We would love to have one and make a fix, if the issue is related to the SqlClient driver.

On some other note, have you tried increasing the SqlCommand timeout ?

Will-Bill commented 2 years ago

@JRahnama the problem we have is it seems random. When it does happen, it seems to lock the application and Application Insights stops reporting any data, so during the period the only logs we have are the console logs which report the error mentioned.

I have no idea how to replicate the issue. Working blind means we are not going to get anywhere far fast which I appreciate. Do you have any suggestions what to do in Azure App Service to get more information about the causal factors leading to the error? this might help us reproduce the problem.

The application which this happens in is a monolith with lots of moving parts. If it was a simple application with few moving parts it would make it easier to start looking. Right now it is a needle in a haystack.

We have tried changing the connection string time out settings.

JRahnama commented 2 years ago

@Will-Bill other option would be collecting EventSource tracing in Microsoft.Data.SqlClient. In this case, since we do not know when it may happen, PerfView wont help much in my opinion. The best option is writing traces to a file by adding a listener to your application. I am not sure about the perf effect. The file also may go huge as we gather it for couple of days. Will you be able to do so and send us the logs?