dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
856 stars 286 forks source link

Errors after upgrading to 5.2.0 from 5.1.5 on Linux #2378

Closed alex-jitbit closed 3 months ago

alex-jitbit commented 8 months ago

After upgrading 5.1.5 to 5.2.0 on Linux (Ubuntu) .NET 8 I'm getting thousands of errors:

Exception message:

Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.TimeoutException: The socket couldn't connect during the expected 14965 remaining time.

Stack trace:

Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.TimeoutException: The socket couldn't connect during the expected 14965 remaining time.
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeoutTimer timeout, SqlConnectionIPAddressPreference ipPreference, String cachedFQDN, SQLDNSInfo& pendingDNSInfo)
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle..ctor(String serverName, Int32 port, TimeoutTimer timeout, Boolean parallel, SqlConnectionIPAddressPreference ipPreference, String cachedFQDN, SQLDNSInfo& pendingDNSInfo, Boolean tlsFirst, String hostNameInCertificate, String serverCertificateFilename)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.Open()

Some requests work fine, but about 50% throw this error. Reverting back to 5.1.5 solves the problem.

Further technical details

Microsoft.Data.SqlClient version: 5.2 .NET target: NET 8 SQL Server version: SQL 2017 on Linux Operating system: Ubuntu 22

apxltd commented 4 months ago

@JRahnama hmmm okay okay.... maybe switch to Chat GPT instead of Bing AI 😉

Here's what it suggested when I prompted "An annoyed customer responded with this. Please help me craft a short responses."

Thank you for bringing this to our attention. I apologize for the frustration and inconvenience you’ve experienced. Your feedback is invaluable, and I want to assure you that we take these issues very seriously.

We understand the significant investment you’ve made in SQL Server, and it’s clear that the recent changes have negatively impacted your operations. This is not the level of service we aim to provide, and we are committed to resolving this issue promptly.

Our team is currently investigating the root cause of the problem and will work diligently to find a solution. We will keep you updated with our progress and ensure that any changes are thoroughly tested moving forward.

Your patience and understanding are greatly appreciated. If you have any further concerns or require immediate assistance, please do not hesitate to reach out.

You see, a high-quality GPT response like that gives you the impression that you aren't being passive-aggressive. And it almost gives the impression that you care about the problems you have caused for your customers.

PaulVrugt commented 4 months ago

So I'm trying to find the most suitable workaround for our application that works properly in production.

am I to understand that if I increase the connect timeout parameter the issue does not occur? Will it then still take a lot of time to connect each time. Or does it only take a lot of time until enough threads are created to handle all connections, and will it then be fast until the application is restarted?

I also don't really understand the resistance of setting min threads. If apparently your application needs that amount of threads in production workloads, what is the issue in setting it so it starts with enough threads?

Nb: let's please try to keep this discussion productive. I too would like to stress this is a pressing issue for us, but let's not make this a toxic discussion

JRahnama commented 4 months ago

@alex-jitbit if increasing the ConnectTimeout is not a preferred solution, I may have a solution for this issue until SqlClientX is out for good. Give me some time to test my proposed solution and discuss that internally with other team members, in case of positive outcome I will give you another nuget to test with.

JRahnama commented 4 months ago

So I'm trying to find the most suitable workaround for our application that works properly in production.

am I to understand that if I increase the connect timeout parameter the issue does not occur? Will it then still take a lot of time to connect each time. Or does it only take a lot of time until enough threads are created to handle all connections, and will it then be fast until the application is restarted?

I also don't really understand the resistance of setting min threads. If apparently your application needs that amount of threads in production workloads, what is the issue in setting it so it starts with enough threads?

Nb: let's please try to keep this discussion productive. I too would like to stress this is a pressing issue for us, but let's not make this a toxic discussion

In my testing with 50 concurrent connections, I did not notice any specific performance issues. This is based on observation rather than benchmark testing. It worked pretty well when I increased the timeout to 60 seconds. This provided enough time for the socket to connect. I think that would be the optimal workaround for now. However, as I mentioned before, give me some time to test my findings, and I will get back to you.

BlythMeister commented 3 months ago

Is there scope to create a 5.1.6 which is 5.1.5 with the dependencies on Azure.Identity and Microsoft.Identity.Client updated to patches with CVE-2024-35255 fixes applied. I fully understand find a "proper" solution may take time, but forcing users connecting to SQL server and running on linux into unproven workarounds on PRD systems to be able to patch against such a serious vulnerability in another Microsoft library is completely unacceptable.

A patch of 5.1.5 which updates the transient dependencies with vulnerabilities is (in my opinion) essential. It is now nearly 2 months since these transient dependencies were updated to fix the CVE, but it's impossible to get these updates.

PaulVrugt commented 3 months ago

So I'm trying to find the most suitable workaround for our application that works properly in production. am I to understand that if I increase the connect timeout parameter the issue does not occur? Will it then still take a lot of time to connect each time. Or does it only take a lot of time until enough threads are created to handle all connections, and will it then be fast until the application is restarted? I also don't really understand the resistance of setting min threads. If apparently your application needs that amount of threads in production workloads, what is the issue in setting it so it starts with enough threads? Nb: let's please try to keep this discussion productive. I too would like to stress this is a pressing issue for us, but let's not make this a toxic discussion

In my testing with 50 concurrent connections, I did not notice any specific performance issues. This is based on observation rather than benchmark testing. It worked pretty well when I increased the timeout to 60 seconds. This provided enough time for the socket to connect. I think that would be the optimal workaround for now. However, as I mentioned before, give me some time to test my findings, and I will get back to you.

Any update on this @JRahnama ?

JRahnama commented 3 months ago

@PaulVrugt Unfortunately, my testing didn't yield positive results. I've explored various solutions within the current design, but none were effective and each had its own drawback.

In the meantime, here are some workarounds I've found:

  1. Increase ConnectTimeout: This can help by providing more time for socket connection.
  2. Set Minimum Available Threads: Adjusting the minimum available threads in your application might help, although it could impact performance.
  3. Disable Pooling: Turning off pooling creates a new connection each time, which is not ideal as it’s quite resource-intensive.

I understand that the issue is frustrating, but I want to assure you that we're doing our best to resolve it as quickly as possible. Our team is actively working on redesigning the pool and concurrent connections, and it's progressing well.

BlythMeister commented 3 months ago

@PaulVrugt Unfortunately, my testing didn't yield positive results. I've explored various solutions within the current design, but none were effective and each had its own drawback.

In the meantime, here are some workarounds I've found:

  1. Increase ConnectTimeout: This can help by providing more time for socket connection.
  2. Set Minimum Available Threads: Adjusting the minimum available threads in your application might help, although it could impact performance.
  3. Disable Pooling: Turning off pooling creates a new connection each time, which is not ideal as it’s quite resource-intensive.

I understand that the issue is frustrating, but I want to assure you that we're doing our best to resolve it as quickly as possible. Our team is actively working on redesigning the pool and concurrent connections, and it's progressing well.

@JRahnama Whilst i fully appreciate your working on a proper fix for this issue. Is there really no scope to release a hotfix to 5.1.5 with the dependencies updated as per my comment - https://github.com/dotnet/SqlClient/issues/2378#issuecomment-2247689996 ??

ErikEJ commented 3 months ago

@BlythMeister Yes, 5.1.6 is being worked on

cbi-at-varian commented 3 months ago

The provided workarounds aren't mitigating the issue completely in our case. The only "solution" is actually to downgrade to 5.1.5

We're load testing our service with k6 using the ramping-vus executor. We're ramping it up to 1000 vus and let the test run for 3min. For one of our scenario we tested using different connection timeouts with the 5.2.1 version as suggested workaround 1)

setup details average request time minimum request time median request time max request time failed calls out of max
SqlClient.5.2.1; default Connection Timeout (15) 356.52ms 1.18ms 346.64ms 15.51s failed 1.41% (1605/112097)
SqlClient.5.2.1; Connection Timeout=30 361.04ms 1.21ms 344.56ms 30.66s failed 1.17% (1418/119465)
SqlClient.5.2.1; Connection Timeout=60 358.13ms 1.19ms 347.49ms 1m0s failed 0.27% (316/116617)
SqlClient.5.1.5; default Connection Timeout (15) 384.03m 1.08ms 367.17ms 1.45s failed 0% (0/119816 )

The higher we put the timeout the better it went. Going to 180s we got zero errors, but we started to get client timeouts. Which is not acceptable as well.

The suggested workaround 2) to increase the Threadpool minimum size did not provide any better results. Honestly, we didn't try out the suggested workaround 3) as it sounds just too scary..

So please a proper fix would be really appreciated. Is there in the meantime an ETA for the fix?

ahouben commented 3 months ago

@BlythMeister Yes, 5.1.6 is being worked on

@ErikEJ @JRahnama would you know more about an ETA for when the connection pooling/handling is fixed and released in a 5.1.6 (hotfix) or 5.2.2/3 (on top of HEAD) version ?

From https://www.nuget.org/packages/Microsoft.Data.SqlClient... image

robs commented 3 months ago

So, have I read this wrong or does the latest stable release (5.2.1) in nuget have this issue?

ahouben commented 3 months ago

@robs The latest version 5.2.1 on nuget still has this issue, it has not been fixed yet.

robs commented 3 months ago

Oh wow, surely the "stable" flag should be removed then?

We've not updated to 5.2.x yet but we're running Linux clients to Windows MS SQL.

Is everyone just waiting in the hope there's a 5.1.x release that has the updated dependencies?

JonasJes commented 3 months ago

@cbi-at-varian Great work showing what we are experiencing

JRahnama commented 3 months ago

Is everyone just waiting in the hope there's a 5.1.x release that has the updated dependencies?

5.1.x will be released soon ( in a week or so) with updated dependencies.

Markeli commented 3 months ago

And what about 5.2.x? Will it also be released soon?

JRahnama commented 3 months ago

Could someone please test the package below to see if the issue is resolved? Simply change the extension to nupkg. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zip

Update: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.

JRahnama commented 3 months ago

And what about 5.2.x? Will it also be released soon?

If the fix works it will be backported to 5.2 and the hot fix will be released in second half of August.

JRahnama commented 3 months ago

The issue was at TCPHanlde.Connect function. Having those async tasks in the middle of the primary sync flow was causing thread pool starvation under load. The proposed solution was to remove that as we are using socket.Select() later with the timeout to honor the timeout. Which was also mentioned at here by @roji

cbi-at-varian commented 3 months ago

Could someone please test the package below to see if the issue is resolved? Simply change the extension to nupkg. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zip

Update: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.

I integrated the package provided by @JRahnama and run the load tests. It seems to fix the issue, thanks for that!

Here the numbers to compare against: setup details average request time minimum request time median request time max request time failed calls out of max
SqlClient.5.1.5; default Connection Timeout (15) 398.72m 1.19ms 376.23ms 1.49s failed 0% (0/226103)
SqlClient.6.0.0.pull; default Connection Timeout (15) 434.98ms 1.3ms 401.33ms 1.51s failed 0% (0/207359)

Please note the numbers do not imply a certain quality of the fix, the load test runs aren't deterministic. Important for me is that we keep the expected thresholds and that there are no failures. It feels the fix is slightly slower, but only a benchmark would reveal that.

saurabh500 commented 3 months ago

@cbi-at-varian what does the average request time column measure? The average time taken to open a connection?

JRahnama commented 3 months ago

Here are benchmarks after the fix:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) 13th Gen Intel Core i7-1365U, 1 CPU, 12 logical and 10 physical cores .NET SDK 8.0.400 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

Job=.NET 8.0 Runtime=.NET 8.0

Method Mean Error StdDev Ratio RatioSD
OpenSqlConnectionAsync 172.3 ms 3.05 ms 2.55 ms 1.00 0.02

below are results from 5.1.5::

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) 13th Gen Intel Core i7-1365U, 1 CPU, 12 logical and 10 physical cores .NET SDK 8.0.400 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

Job=.NET 8.0 Runtime=.NET 8.0

Method Mean Error StdDev Ratio RatioSD
OpenSqlConnectionAsync 177.0 ms 3.52 ms 4.58 ms 1.00 0.04

My sample app was the repro provided from this issue with ConnectTimeout set to 10 seconds.

JRahnama commented 3 months ago

There are a couple of PRs to be bacported to 5.2 and the we will proceed with the release of hotfix 5.2.2

cbi-at-varian commented 3 months ago

@cbi-at-varian what does the average request time column measure? The average time taken to open a connection?

@saurabh500 the measurements are on the client side of the load tests. Those numbers therefore include the web stack, some small business logic including accessing the database.

patrickg-hchb commented 2 months ago

Could someone please test the package below to see if the issue is resolved? Simply change the extension to nupkg. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zip Update: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.

I integrated the package provided by @JRahnama and run the load tests. It seems to fix the issue, thanks for that!

Here the numbers to compare against:

setup details average request time minimum request time median request time max request time failed calls out of max SqlClient.5.1.5; default Connection Timeout (15) 398.72m 1.19ms 376.23ms 1.49s failed 0% (0/226103) SqlClient.6.0.0.pull; default Connection Timeout (15) 434.98ms 1.3ms 401.33ms 1.51s failed 0% (0/207359) Please note the numbers do not imply a certain quality of the fix, the load test runs aren't deterministic. Important for me is that we keep the expected thresholds and that there are no failures. It feels the fix is slightly slower, but only a benchmark would reveal that.

We are still seeing the same issue on our side with this version. open MS ticket 2406250040008194 Is there someone here from MS that can help with this issue?

bdschaap commented 2 months ago

Could someone please test the package below to see if the issue is resolved? Simply change the extension to nupkg. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zip

Update: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.

Can a preview nuget be pushed?

JRahnama commented 2 months ago

Can a preview nuget be pushed?

M.D.SqlClient v6.0.0-preview1 and the v5.2.2 hotfix release are set to be released within the next day or two.

JRahnama commented 2 months ago

Hotfix v5.2.2 is released.