Closed AfsanehR-zz closed 3 years ago
That's not my PR, its yours so i can't do anything with it.
@Wraith2 We got the PR updated, reviewed and merged (was due on us), but on testing with NuGet package I still see the issue with EF Core test, and the error shows up intermittently.
I guess the heavy and slow solution got me confused and accidentally thinking this issue is fixed with the PR, due to it's intermittent nature. But I guess I'll continue with investigations 😞
It really would be good to get an SqlClient only repro. I'm fairly sure it's a timing issue but without being able to identify where the error starts it's guesswork.
@mshenoy83 Sorry it took a bit longer, here goes my dirty hack. In order to use it, please check the "Custom execution strategy" section of this documentation https://docs.microsoft.com/en-us/ef/core/miscellaneous/connection-resiliency
class CustomExecutionStrategy:
using Brulasoft.Inviton.Data.Internal.SQL; using Microsoft.EntityFrameworkCore; using Microsoft.EntityFrameworkCore.Storage; using System; using System.Data.SqlClient; namespace Brulasoft.Inviton.Data.Internal { /// <summary> /// Custom SQL execution /// </summary> internal class CustomExecutionStrategy : SqlServerRetryingExecutionStrategy { public CustomExecutionStrategy(ExecutionStrategyDependencies dependencies) : base(dependencies) { } protected override bool ShouldRetryOn(Exception exception) { if (base.ShouldRetryOn(exception)) { return true; } if (exception is SqlException sqlException) { if (SqlRecoveryAttemptor.ShouldTryRecover(exception, ExceptionsEncountered.Count) && Dependencies != null && Dependencies.CurrentDbContext != null) { var context = Dependencies.CurrentDbContext.Context; if (context != null) { var sqlConn = context.Database.GetDbConnection() as SqlConnection; if (sqlConn != null) { SqlRecoveryAttemptor.TryRecover(sqlConn, ExceptionsEncountered.Count); return true; } } } } return false; } } }
class SqlRecoveryAttemptor
using System; using System.Collections.Generic; using System.Data.Common; using System.Data.SqlClient; using System.Text; namespace Brulasoft.Inviton.Data.Internal.SQL { public class SqlRecoveryAttemptor { public const int RetryCount = 4; public static bool ShouldTryRecover(Exception ex, int retryAttemptsCount) { return retryAttemptsCount < RetryCount && IsMARSError(ex); } public static bool IsMARSError(Exception ex) { return ex.Message.Contains("MARS TDS header"); } public static void TryRecover(SqlConnection sqlConn, int retryAttemptsCount) { if (retryAttemptsCount > 2) { System.Threading.Thread.Sleep(1500); } if (sqlConn.State != System.Data.ConnectionState.Closed) { TryClose(sqlConn); TryClearPool(sqlConn); TryOpen(sqlConn); } else { TryOpen(sqlConn); TryClose(sqlConn); TryClearPool(sqlConn); TryOpen(sqlConn); TryClose(sqlConn); } } private static void TryClearPool(SqlConnection conn) { try { SqlConnection.ClearPool(conn); } catch (Exception) { } } private static void TryClose(DbConnection conn) { try { conn.Close(); } catch (Exception) { } } private static void TryOpen(DbConnection conn) { try { conn.Open(); } catch (Exception) { } } } }
@brunolau curious why there are the extra Open and Close calls, from the ClearPool
documentation it seems it shouldn't matter, i.e. you could just call ClearPool
. Did you find you needed those extra calls?
@earldean Not at all..they're most likely redundant. As the issue was emerging once in 1-2 days and I was basically doing a "blindfold hack", I've just tried anything that might have had patched the problem for me...to sum it up, no particular reason for the additional Open/Close calls
@brunolau Thanks for the quick reply! I just wanted to double check, I thought you possibly have learned something through trial and error :)
It is quite hard to test, not just the bug of course since it is very non-deterministic, but testing the custom SqlServerRetryingExecutionStrategy
. I suppose I should share my test case I wrote for this.
Thanks!
using Microsoft.EntityFrameworkCore;
using Microsoft.EntityFrameworkCore.Internal;
using Microsoft.EntityFrameworkCore.Storage;
using Moq;
using NUnit.Framework;
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Reflection;
using System.Data.SqlClient;
using System.Runtime.Serialization;
[Test]
public void TestReturnsTrueWhenSqlExceptionHasMarsTdsMessage()
{
var sqlConnection = new SqlConnection({SomeLocalSqlConnString});
var dbContext = new DbContext(new DbContextOptionsBuilder().
UseSqlServer(sqlConnection).Options);
var mockedICurrentDbContext = new Mock<ICurrentDbContext>();
mockedICurrentDbContext.Setup(m => m.Context).Returns(dbContext);
var dependencies = new ExecutionStrategyDependencies(
mockedICurrentDbContext.Object,
new DbContextOptionsBuilder<AlarmLogContext>().Options,
null);
// The MarsTdsExceptionSqlRecoverHack class is the class that extends SqlServerRetryingExecutionStrategy
var po = new PrivateObject(new MarsTdsExceptionSqlRecoverHack(dependencies));
var shouldBeFalse = (bool)po.Invoke("ShouldRetryOn", new Exception("MARS TDS header"));
NUnit.Framework.Assert.IsFalse(shouldBeFalse);
var sqlException = FormatterServices.GetUninitializedObject(typeof(SqlException)) as SqlException;
var messageField = typeof(SqlException).GetField("_message", BindingFlags.NonPublic | BindingFlags.Instance);
messageField.SetValue(sqlException, "MARS TDS header");
// test SQL exception without MARS TDS message
var shouldBeTrue = (bool)po.Invoke("ShouldRetryOn", sqlException);
NUnit.Framework.Assert.IsTrue(shouldBeTrue);
sqlException = FormatterServices.GetUninitializedObject(typeof(SqlException)) as SqlException;
messageField.SetValue(sqlException, "Banana");
shouldBeFalse = (bool)po.Invoke("ShouldRetryOn", sqlException);
NUnit.Framework.Assert.IsFalse(shouldBeFalse);
}
@ajcvickers any chance of getting someone on the EF side to see if they can assist with disentangling what EF is doing to the repro we've got so far so we can investigate this as the driver issue it is?
We are on netcoreapp3.1, EFCore 3.1.2 and experiencing this every other day.
.NET Core SDK (reflecting any global.json): Version: 3.1.101 Commit: b377529961
Runtime Environment: OS Name: Windows OS Version: 10.0.18363 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\3.1.101\
Host (useful for support): Version: 3.1.1 Commit: a1388f194c
.NET Core SDKs installed: 2.1.2 [C:\Program Files\dotnet\sdk] 2.1.201 [C:\Program Files\dotnet\sdk] 2.1.202 [C:\Program Files\dotnet\sdk] 2.1.300-rc1-008673 [C:\Program Files\dotnet\sdk] 2.1.300 [C:\Program Files\dotnet\sdk] 2.1.301 [C:\Program Files\dotnet\sdk] 2.1.400 [C:\Program Files\dotnet\sdk] 2.1.401 [C:\Program Files\dotnet\sdk] 2.1.502 [C:\Program Files\dotnet\sdk] 2.1.504 [C:\Program Files\dotnet\sdk] 2.1.508 [C:\Program Files\dotnet\sdk] 2.1.602 [C:\Program Files\dotnet\sdk] 2.1.801 [C:\Program Files\dotnet\sdk] 2.2.104 [C:\Program Files\dotnet\sdk] 2.2.203 [C:\Program Files\dotnet\sdk] 2.2.401 [C:\Program Files\dotnet\sdk] 3.1.101 [C:\Program Files\dotnet\sdk]
.NET Core runtimes installed: Microsoft.AspNetCore.All 2.1.0-rc1-final [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.9 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.2.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.2.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.2.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.App 2.1.0-rc1-final [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.9 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.2.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.2.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.2.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 3.1.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.0.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.0.9 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.0-rc1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.3-servicing-26724-03 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.8 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.9 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.2.2 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.2.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.2.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.WindowsDesktop.App 3.1.1 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
We started observing this problem in production sporadically. After digging in our telemetry, we noticed that the exceptions listed ☝️ were always proceeded by exceptions in other requests (that likely shared the same Sql Connection Pool) that indicated that they had called .Cancel()
on their SqlCommand
in response to their request being canceled. After some experimentation we able to create a sample Console application that appears to reproduce the problem very constantly. The example application can be found here https://github.com/boxofyellow/MarsHeader
Thanks @boxofyellow
We've been eagerly waiting for a repro to address this issue, I'll give this a try soon!
I've been digging into this and wanted to summarize my finding so far, in case someone else is looking, too.
The error occurs because _nonTransactedOpenResultCount in TdsParser.cs goes below 0 in certain situations. _nonTransactedOpenResultCount gets sent in the MARS header data and values < 0 are not valid.
This error was introduced to System.Data.SqlClient on .NET Core in this PR: https://github.com/dotnet/corefx/pull/26595
Which removes stateObj._syncOverAsync = true;
from TryProcessDone() in TdsParser.cs
The issue also exists against .NET Framework if you set this context switch to false (it's true by default): Switch.Microsoft.Data.SqlClient.MakeReadAsyncBlocking
The comment associated with the line above ("// Can't retry TryProcessDone) tells me that TryProcessDone might be completing more than once when it should not be, since at the end of that method is where _nonTransactedOpenResultCount is getting decremented.
We could simply revert the change from the above PR, but that would re-introduce what it was trying to fix. We could also make sure _nonTransactedOpenResultCount simply never gets decremented below 0, but that doesn't seem right, off hand. Unless we can figure out a better solution, we may just have to revert the PR.
How is TryProcesDone being reached multiple times with the same data? It is an async packet list replay? If so the counters may need moving into the snapshot so a replay will only affect take effect once no matter how many times it's replayed.
How is TryProcesDone being reached multiple times with the same data? It is an async packet list replay? If so the counters may need moving into the snapshot so a replay will only affect take effect once no matter how many times it's replayed.
@Wraith2 As I was thinking about it last night, after I commented, it might be that TryProcessDone is only running once. It might be that IncrementAndObtainOpenResultCount is not being called when the driver thinks it has (maybe due to a "perfectly timed" cancel?), thus later trying to decrement the count when it should not. I still need to investigate further.
I had a look through and found that even putting the counter in the snapshot wouldn't help because the transacted version lives on the SqlInternalTransaction object which isn't snapped.
Hey there, we're facing the same situation here: ef core 2.1.3
System.Data.SqlClient.SqlException (0x80131904): The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.
The production app is currently on a heavy traffic situation and the issue appear. It has never happen before... We have added CPU's and RAM, we are on a single production line.
Is there any chance to see a fix on this issue, it appears that it was reported in 2018...
It also started happening for us a few hours ago on our Azure Sql Business Critical. The app is also under heavy load. We are getting hundreds of errors:
Exception: Microsoft.Data.SqlClient.SqlException (0x80131904): The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.
asp.net core 3.1 / windows. Assembly Microsoft.Data.SqlClient, Version=1.0.19269.1
Actually what is the best option to avoid this blocking issue ? Because we're really in a bad situation here....
Will it help if we upgrade to the latest EF Core / Microsoft.Data.SqlClient? We are on ef core 3.1.0 but I see there is a 3.1.3 now,,,
We have done an emergency production restart and it fixed the issue.
We conclude that the process can get into a irreversible failure state (likely caused heavy load) where it has to be restarted. Is that possible?
Not the process, this isn't a corruption bug. If all connections in a pool enter an unrecoverable state and aren't doomed out of the pool then you could end up with every connection in the pool being unusable. In that case you could add code to manually force clear the pool (pretty sure there's public api surface to do this already) which would force creation of all new connections at a minor occasional perf cost.
Shouldn't that be done automatically though? We've never had to deal with this problem and are definitely not keen on meddling with the db pool.
Shouldn't that be done automatically though?
What? an unusable connection be removed from the pool? Sure, but this is a bug so it doesn't do what it's supposed to, that's what bugs are.
We've never had to deal with this problem and are definitely not keen on meddling with the db pool.
As I see if you have 3 options.
Your call.
@clement911 You can use my nasty execution strategy workaround mentioned in the comments above. It still does its job even in EF core 3 (with some slight modifications)
@brunolau could you please share modifications for the workaround related to EF core 3.
{"code":"System.Data.SqlClient.SqlException","message":"The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.","template":"","data":{}} I also encountered this problem, using dapper + NETCORE2.2. Is there any update on this issue?
From that message we can see that you're using System.Data.SqlClient
not Microsoft.Data.SqlClient
, the system version is deprecated in favour of this library, I suggest switching.
@Wraith2 I'll replace it immediately, hoping that the problem can be solved.Thanks.
We were also using the the System.Data.SqlClient
package, I updated the example repro to try the Microsoft.Data.SqlClient
package and the problem appears to be visible there as well.
@jaredhshuai @boxofyellow @nurhat
Our proposed fix in discussion is in PR #490 , would you please give that a try and let us know if that resolves your problem?
I pulled down the nuget package listed in the PR ☝️, it looks like it solved the problem for the Microsoft
package. I'll run a batch of them overnight to see if anything else shakes out. Is there any chance we can expect to see a similar fix for the System
package? We might take this as the catalyst to switch over to the Microsoft
package, but that will dramatically slow down rolling the change out into production.
We shall consider backporting it to System.Data.SqlClient (netcoreapp3.1) if it proves to be promising and we gather confirmation from users from different domains.
cc @David-Engel for final confirmation.
@cheenamalhotra I'm planning to test against wraith's test on release/3.0 and the repro from https://github.com/dotnet/SqlClient/issues/85#issuecomment-601864224. have these tests been run yet? If they pass, would that be enough confidence to merge https://github.com/dotnet/SqlClient/pull/490?
@michaeltnguyen that would be a nice add-on for sure!
I will also be testing this soon, I'm just occupied with other priorities.
From that message we can see that you're using
System.Data.SqlClient
notMicrosoft.Data.SqlClient
, the system version is deprecated in favour of this library, I suggest switching.
@Wraith2 After switching, the issue is still existing.
{"code":"Microsoft.Data.SqlClient.SqlException","message":"The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.","template":"","data":{}}
@cheenamalhotra Useing following package, I had finished load test.
"Microsoft.Data.SqlClient" Version="2.0.0-preview1.20021.1"
ThroughInput=6.8/s When the number of samples is less than 10000, it is stable. After more than 10000, the following error occurred:
{"code":"Microsoft.Data.SqlClient.SqlException","message":"The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.","template":"","data":{}}
The final number of samples is 50000, with an error rate of 45%.
@jaredhshuai is that the package linked to in this comment https://github.com/dotnet/SqlClient/pull/490#issuecomment-604634276 ?
@Wraith2
When I use this below version,
<PackageReference Include="Microsoft.Data.SqlClient" Version="2.0.0-dev-pr490" />
Visual Studio said it couldn't be found, and recommended the following version for me.
<PackageReference Include="Microsoft.Data.SqlClient" Version="2.0.0-preview1.20021.1" />
You'll have to download the nuget package from the link in that comment and manually add it to the nuget cache folder so it can be resolved. I use a local nuget food for this.
I need deployed to QA environment to test. In that case, the package will not be found in QA environment.
Put the file in a folder and then add a nuget config file with a nuget package source pointing at that folder, https://docs.microsoft.com/en-us/nuget/reference/nuget-config-file#package-source-sections
Repetitively running the sample application over night (about 1500 runs) with the package from #490, did not yield any MARDS TDS header exceptions.
When I'm trying to run the pr build against Wraith2's test on release/3.0, I'm getting the following error:
NU1202 Package Microsoft.Data.SqlClient 2.0.0-preview2.20084.1 is not compatible with netstandard1.3
I also tried ONLY updating the version of SqlClient on EfCore.SqlServer.FunctionalTests, but the BreakTdsStream
test was still failing. (I'm unsure if it's sufficient to only update SqlClient on that single project, though)
You also need to update versions.props
Hi @ErikEJ , are you saying that I need to update version.props somewhere or that the PR author needs to update version.props in the nuget package? If it's me, could you point me to some basic instructions of what I'd need to do? (I tried some basic googling, but hit diminishing returns very quickly)
I think maybe I misunderstood. You are not using the EF Core source code?
@ErikEJ I am using the EF Core source code. I added Wraith2's breaking test on the release/3.0, then I'm using the PR build of Microsoft.Data.SqlClient to see if it fixes the breaking test.
Then have a look at the versions.props file in the eng folder
When I use the nuget from the PR build, I'm now unable to resolve Microsoft.Data.SqlClient
@cheenamalhotra any thoughts? am I just doing something wrong here?
Have you set up s local feed and added it to NuGet.config ?
When I use the nuget from the PR build, I'm now unable to resolve
Microsoft.Data.SqlClient
@cheenamalhotra any thoughts? am I just doing something wrong here?
Are you not able to restore the package at all or you get an exception? Check your NuGet.config and logs if the location is being considered when restoring.
Copied from here:
I have an ASP.NET Core app [Microsoft.AspNetCore.App 2.1.4] with EF Core 2.1.4 [DbContext pooling enabled] and data stored on an Azure SQL database.
Occasionally [once in 1-2 days] I get unrecoverable error stating System.Data.SqlClient.SqlException (0x80131904): The incoming tabular data stream (TDS) protocol stream is incorrect. The MARS TDS header contained errors.
Once I restart the app, everything resumes working until it happens again.
Probably import to note is that I don't have MultipleActiveResultSets enabled in my connection string at all, which makes the issue even more strange.
Has anyone else noticed something similar? Are there any ways I can trace the problem?
Stack trace: