Closed alex-jitbit closed 3 months ago
@JRahnama hmmm okay okay.... maybe switch to Chat GPT instead of Bing AI 😉
Here's what it suggested when I prompted "An annoyed customer responded with this. Please help me craft a short responses."
Thank you for bringing this to our attention. I apologize for the frustration and inconvenience you’ve experienced. Your feedback is invaluable, and I want to assure you that we take these issues very seriously.
We understand the significant investment you’ve made in SQL Server, and it’s clear that the recent changes have negatively impacted your operations. This is not the level of service we aim to provide, and we are committed to resolving this issue promptly.
Our team is currently investigating the root cause of the problem and will work diligently to find a solution. We will keep you updated with our progress and ensure that any changes are thoroughly tested moving forward.
Your patience and understanding are greatly appreciated. If you have any further concerns or require immediate assistance, please do not hesitate to reach out.
You see, a high-quality GPT response like that gives you the impression that you aren't being passive-aggressive. And it almost gives the impression that you care about the problems you have caused for your customers.
So I'm trying to find the most suitable workaround for our application that works properly in production.
am I to understand that if I increase the connect timeout parameter the issue does not occur? Will it then still take a lot of time to connect each time. Or does it only take a lot of time until enough threads are created to handle all connections, and will it then be fast until the application is restarted?
I also don't really understand the resistance of setting min threads. If apparently your application needs that amount of threads in production workloads, what is the issue in setting it so it starts with enough threads?
Nb: let's please try to keep this discussion productive. I too would like to stress this is a pressing issue for us, but let's not make this a toxic discussion
@alex-jitbit if increasing the ConnectTimeout is not a preferred solution, I may have a solution for this issue until SqlClientX is out for good. Give me some time to test my proposed solution and discuss that internally with other team members, in case of positive outcome I will give you another nuget to test with.
So I'm trying to find the most suitable workaround for our application that works properly in production.
am I to understand that if I increase the connect timeout parameter the issue does not occur? Will it then still take a lot of time to connect each time. Or does it only take a lot of time until enough threads are created to handle all connections, and will it then be fast until the application is restarted?
I also don't really understand the resistance of setting min threads. If apparently your application needs that amount of threads in production workloads, what is the issue in setting it so it starts with enough threads?
Nb: let's please try to keep this discussion productive. I too would like to stress this is a pressing issue for us, but let's not make this a toxic discussion
In my testing with 50 concurrent connections, I did not notice any specific performance issues. This is based on observation rather than benchmark testing. It worked pretty well when I increased the timeout to 60 seconds. This provided enough time for the socket to connect. I think that would be the optimal workaround for now. However, as I mentioned before, give me some time to test my findings, and I will get back to you.
Is there scope to create a 5.1.6 which is 5.1.5 with the dependencies on Azure.Identity and Microsoft.Identity.Client updated to patches with CVE-2024-35255 fixes applied. I fully understand find a "proper" solution may take time, but forcing users connecting to SQL server and running on linux into unproven workarounds on PRD systems to be able to patch against such a serious vulnerability in another Microsoft library is completely unacceptable.
A patch of 5.1.5 which updates the transient dependencies with vulnerabilities is (in my opinion) essential. It is now nearly 2 months since these transient dependencies were updated to fix the CVE, but it's impossible to get these updates.
So I'm trying to find the most suitable workaround for our application that works properly in production. am I to understand that if I increase the connect timeout parameter the issue does not occur? Will it then still take a lot of time to connect each time. Or does it only take a lot of time until enough threads are created to handle all connections, and will it then be fast until the application is restarted? I also don't really understand the resistance of setting min threads. If apparently your application needs that amount of threads in production workloads, what is the issue in setting it so it starts with enough threads? Nb: let's please try to keep this discussion productive. I too would like to stress this is a pressing issue for us, but let's not make this a toxic discussion
In my testing with 50 concurrent connections, I did not notice any specific performance issues. This is based on observation rather than benchmark testing. It worked pretty well when I increased the timeout to 60 seconds. This provided enough time for the socket to connect. I think that would be the optimal workaround for now. However, as I mentioned before, give me some time to test my findings, and I will get back to you.
Any update on this @JRahnama ?
@PaulVrugt Unfortunately, my testing didn't yield positive results. I've explored various solutions within the current design, but none were effective and each had its own drawback.
In the meantime, here are some workarounds I've found:
I understand that the issue is frustrating, but I want to assure you that we're doing our best to resolve it as quickly as possible. Our team is actively working on redesigning the pool and concurrent connections, and it's progressing well.
@PaulVrugt Unfortunately, my testing didn't yield positive results. I've explored various solutions within the current design, but none were effective and each had its own drawback.
In the meantime, here are some workarounds I've found:
- Increase ConnectTimeout: This can help by providing more time for socket connection.
- Set Minimum Available Threads: Adjusting the minimum available threads in your application might help, although it could impact performance.
- Disable Pooling: Turning off pooling creates a new connection each time, which is not ideal as it’s quite resource-intensive.
I understand that the issue is frustrating, but I want to assure you that we're doing our best to resolve it as quickly as possible. Our team is actively working on redesigning the pool and concurrent connections, and it's progressing well.
@JRahnama Whilst i fully appreciate your working on a proper fix for this issue. Is there really no scope to release a hotfix to 5.1.5 with the dependencies updated as per my comment - https://github.com/dotnet/SqlClient/issues/2378#issuecomment-2247689996 ??
@BlythMeister Yes, 5.1.6 is being worked on
The provided workarounds aren't mitigating the issue completely in our case. The only "solution" is actually to downgrade to 5.1.5
We're load testing our service with k6 using the ramping-vus executor. We're ramping it up to 1000 vus and let the test run for 3min. For one of our scenario we tested using different connection timeouts with the 5.2.1 version as suggested workaround 1)
setup details | average request time | minimum request time | median request time | max request time | failed calls out of max |
---|---|---|---|---|---|
SqlClient.5.2.1; default Connection Timeout (15) | 356.52ms | 1.18ms | 346.64ms | 15.51s | failed 1.41% (1605/112097) |
SqlClient.5.2.1; Connection Timeout=30 | 361.04ms | 1.21ms | 344.56ms | 30.66s | failed 1.17% (1418/119465) |
SqlClient.5.2.1; Connection Timeout=60 | 358.13ms | 1.19ms | 347.49ms | 1m0s | failed 0.27% (316/116617) |
SqlClient.5.1.5; default Connection Timeout (15) | 384.03m | 1.08ms | 367.17ms | 1.45s | failed 0% (0/119816 ) |
The higher we put the timeout the better it went. Going to 180s we got zero errors, but we started to get client timeouts. Which is not acceptable as well.
The suggested workaround 2) to increase the Threadpool minimum size did not provide any better results. Honestly, we didn't try out the suggested workaround 3) as it sounds just too scary..
So please a proper fix would be really appreciated. Is there in the meantime an ETA for the fix?
@BlythMeister Yes, 5.1.6 is being worked on
@ErikEJ @JRahnama would you know more about an ETA for when the connection pooling/handling is fixed and released in a 5.1.6 (hotfix) or 5.2.2/3 (on top of HEAD) version ?
From https://www.nuget.org/packages/Microsoft.Data.SqlClient...
So, have I read this wrong or does the latest stable release (5.2.1) in nuget have this issue?
@robs The latest version 5.2.1 on nuget still has this issue, it has not been fixed yet.
Oh wow, surely the "stable" flag should be removed then?
We've not updated to 5.2.x yet but we're running Linux clients to Windows MS SQL.
Is everyone just waiting in the hope there's a 5.1.x release that has the updated dependencies?
@cbi-at-varian Great work showing what we are experiencing
Is everyone just waiting in the hope there's a 5.1.x release that has the updated dependencies?
5.1.x will be released soon ( in a week or so) with updated dependencies.
And what about 5.2.x? Will it also be released soon?
Could someone please test the package below to see if the issue is resolved? Simply change the extension to nupkg
. Note that this is an unsigned package and is intended for testing purposes only.
Microsoft.Data.SqlClient.6.0.0-pull.122701.zip
Update: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.
And what about 5.2.x? Will it also be released soon?
If the fix works it will be backported to 5.2 and the hot fix will be released in second half of August.
The issue was at TCPHanlde.Connect function. Having those async tasks in the middle of the primary sync flow was causing thread pool starvation under load. The proposed solution was to remove that as we are using socket.Select() later with the timeout to honor the timeout. Which was also mentioned at here by @roji
Could someone please test the package below to see if the issue is resolved? Simply change the extension to
nupkg
. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zipUpdate: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.
I integrated the package provided by @JRahnama and run the load tests. It seems to fix the issue, thanks for that!
Here the numbers to compare against: | setup details | average request time | minimum request time | median request time | max request time | failed calls out of max |
---|---|---|---|---|---|---|
SqlClient.5.1.5; default Connection Timeout (15) | 398.72m | 1.19ms | 376.23ms | 1.49s | failed 0% (0/226103) | |
SqlClient.6.0.0.pull; default Connection Timeout (15) | 434.98ms | 1.3ms | 401.33ms | 1.51s | failed 0% (0/207359) |
Please note the numbers do not imply a certain quality of the fix, the load test runs aren't deterministic. Important for me is that we keep the expected thresholds and that there are no failures. It feels the fix is slightly slower, but only a benchmark would reveal that.
@cbi-at-varian what does the average request time
column measure? The average time taken to open a connection?
Here are benchmarks after the fix:
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) 13th Gen Intel Core i7-1365U, 1 CPU, 12 logical and 10 physical cores .NET SDK 8.0.400 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Job=.NET 8.0 Runtime=.NET 8.0
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
OpenSqlConnectionAsync | 172.3 ms | 3.05 ms | 2.55 ms | 1.00 | 0.02 |
below are results from 5.1.5::
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3) 13th Gen Intel Core i7-1365U, 1 CPU, 12 logical and 10 physical cores .NET SDK 8.0.400 [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Job=.NET 8.0 Runtime=.NET 8.0
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
OpenSqlConnectionAsync | 177.0 ms | 3.52 ms | 4.58 ms | 1.00 | 0.04 |
My sample app was the repro provided from this issue with ConnectTimeout set to 10 seconds.
There are a couple of PRs to be bacported to 5.2 and the we will proceed with the release of hotfix 5.2.2
@cbi-at-varian what does the
average request time
column measure? The average time taken to open a connection?
@saurabh500 the measurements are on the client side of the load tests. Those numbers therefore include the web stack, some small business logic including accessing the database.
Could someone please test the package below to see if the issue is resolved? Simply change the extension to
nupkg
. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zip Update: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.I integrated the package provided by @JRahnama and run the load tests. It seems to fix the issue, thanks for that!
Here the numbers to compare against:
setup details average request time minimum request time median request time max request time failed calls out of max SqlClient.5.1.5; default Connection Timeout (15) 398.72m 1.19ms 376.23ms 1.49s failed 0% (0/226103) SqlClient.6.0.0.pull; default Connection Timeout (15) 434.98ms 1.3ms 401.33ms 1.51s failed 0% (0/207359) Please note the numbers do not imply a certain quality of the fix, the load test runs aren't deterministic. Important for me is that we keep the expected thresholds and that there are no failures. It feels the fix is slightly slower, but only a benchmark would reveal that.
We are still seeing the same issue on our side with this version. open MS ticket 2406250040008194 Is there someone here from MS that can help with this issue?
Could someone please test the package below to see if the issue is resolved? Simply change the extension to
nupkg
. Note that this is an unsigned package and is intended for testing purposes only. Microsoft.Data.SqlClient.6.0.0-pull.122701.zipUpdate: Our test pipelines passed with the changes, and the provided repro is working well with this set of modifications. The fix appears promising. I’ll wait for the impacted users' test results before proceeding with the PR. Also I will look into adding some tests if possible to prevent this from happening in future. Thank you all for your patience. Also thanks to @David-Engel for pointing out the root cause and solution.
Can a preview nuget be pushed?
Can a preview nuget be pushed?
M.D.SqlClient v6.0.0-preview1 and the v5.2.2 hotfix release are set to be released within the next day or two.
After upgrading 5.1.5 to 5.2.0 on Linux (Ubuntu) .NET 8 I'm getting thousands of errors:
Some requests work fine, but about 50% throw this error. Reverting back to 5.1.5 solves the problem.
Further technical details
Microsoft.Data.SqlClient version: 5.2 .NET target: NET 8 SQL Server version: SQL 2017 on Linux Operating system: Ubuntu 22