Intermittent Unknown error 258 with no obvious cause

deadwards90 commented 2 years ago

Describe the bug

On occasions we will see the following error

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)

However, SQL Server shows no long running queries and is not using a lot of it's resources during the periods where this happens.

It looks to be more of an intermittent connection issue but we're unable to find any sort of root cause.

To reproduce

We're not sure of the reproduction steps. I've been unable to reproduce this myself by simulating load. From what we can tell this is more likely to happen when the pod is busy (not through just HTTP, but handling events from an external source) but equally it can happen randomly when nothing is really happening on the pod which has caused us quite a substantial amount of confusion.

Expected behavior

Either more information on what the cause might be, or some solution to the issue. I realise the driver might not actually know the issue and it may really be a timeout to it's point of view. We're not entirely sure where the problem lies yet, which is the biggest issue.

Further technical details

Microsoft.Data.SqlClient version: 3.0.1 .NET target: Core 3.1 SQL Server version: Microsoft SQL Azure (RTM) - 12.0.2000.8 Operating system: Docker Container - mcr.microsoft.com/dotnet/aspnet:3.1

Additional context

Running in AKS, against Elastic Pools.
SQL Server shows no long running queries
We sometimes get a TimeoutEvent from the metrics that are collected from the pool. On occasions when we do get them, the error_state will be different.
- For example, we had one this morning that was 145. We don't know what this means can find no information on what these relate to. I've raised a ticket with the Azure Docs team to look at this. I'll add more onto this when they happen as we've not been keeping track of the error_state codes as we're not sure if they're even relevant.
This might be related to this ticket - https://github.com/dotnet/SqlClient/issues/647
- However we don't see the ReadSniSyncOverAsync
We do have event counter metrics being exported to Prometheus but have found no obvious indicators that something is wrong

JRahnama commented 2 years ago

As we have seen before in issue #647 the underlying reason may come from a different problems which may not be entirely drivers fault. any interruption in connectivity a missed port or socket failure could lead to this issue. We cannot say more without having more on the context of application. The most helpful step could be a minimal repro which it is usually impossible to create. However we can capture EventSource traces and see what has gone wrong. I would suggest closing the issue and follow #647.

deadwards90 commented 2 years ago

@JRahnama happy to close the issue and add my information to that ticket, but I want to be absolutely sure that the ticket referenced (which I've also included in the original report) is not just for the ReadSniSyncOverAsync errors which are not in our stacktrace.

In the mean time, we'll hook up the EventSource you linked to gather some more information. Did not realise that was available!

EDIT: Any suggestions on which event source traces to enable?

JRahnama commented 2 years ago

@dantheman999301 if you use perfview it will capture all events. make sure you filter them by Microsoft.Data.SqlClient.EventSource name.

deadwards90 commented 2 years ago

@JRahnama unfortunately due to the nature of this issue (only showing up in our production environment, which is in Linux Docker in AKS), I think perfview is not going to work for us, as nice as it would be.

We've managed to wrangle the EventListener so that it will only log on errors. These errors usually show up at least once in the morning for us in a certain service during weekdays so fingers crossed on Monday I should have something for you.

JRahnama commented 2 years ago

I can happily point you to PerfCollect for Unix machines, but not sure how it works on docker. it captures all events and you can transfer the files to a windows machine and investigate them.

deadwards90 commented 2 years ago

@JRahnama good news is, we managed to get the logs (we think).

Bad news is, there are 200,000 of them. Unfortunately it didn't filter like I thought it would and as it was happening during a busy period and we have all Keywords on, there was a lot to collect.

We need to work out how we're going to export them from Kibana, and obviously it's not as good as a full dump. Perfcollect might work but we'll need to work out how we're going to hook it up and run it over what is usually a period of an hour without having an impact on our production systems. We might also be able to use dotnet-monitor but it would require some investigation too.

Let me know if it's of any use.

vincentDAO commented 2 years ago

We got same issue 5 days ago even our code was working and didn't change few days ago

JRahnama commented 2 years ago

@dantheman999301 sorry for the late response we got busy with preview release. Any kind of log that shows or help us to understand where it happens would be helpful.

JRahnama commented 2 years ago

We got same issue 5 days ago even our code was working and didn't change few days ago

Same questions applies to your case as well. Any repro or tracing logs would be helpful. I would also suggest investigation network traces as well. that could clarify some of the underlying issues.

oyvost commented 2 years ago

258 timeout is a common exception when the DTU limit is reached on Azure SQL. If on Azure, you can try to monitor the Max DTU percentage and see if it hits the limit.

deadwards90 commented 2 years ago

@oyvost In the issue we are facing we can see that the connection is not even made to SQL Server so it's not a throttling issue or similar.

For example, we had an error about 20 minutes ago and it was utilising 4% of the available DTUs.

@JRahnama just to get back to you, the logs we thought we had turned out not to be any good, there was a lot of duplication due to the way we wrote the Event Listener. The closest we've got to any answers on this is that we think it's timing out trying to get a connection from the connection pool even though from what we could tell from the event counters there seemed to be plenty of available connections in the pool. We're not overly confident this is the cause but it's something to go off.

We did see this error once in one of the services that is prone to showing the other error I posted.

Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - Unknown error 16974573)
 ---> System.ComponentModel.Win32Exception (16974573): Unknown error 16974573
   at Microsoft.Data.ProviderBase.DbConnectionPool.CheckPoolBlockingPeriod(Exception e)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()

DLS201 commented 2 years ago

Hello, We currently face a similar issue since approximately 1 month, with SQL queries facing Win32Exceptions with code 258 and no obvious cause. The DB itself signals no issue with very low DTU usage.

JRahnama commented 2 years ago

@DLS201 can you post the stack trace of the exception? have you checked the tcp and socket events/logs?

DLS201 commented 2 years ago

Hi, Here is our error message:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): No error information
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__207_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

Database in an Azure SQL instance, no error on this side.

mekk1t commented 2 years ago

@DLS201 Hello! Any progress on this? I'm encountering the same issue, but on three different occassions. I've documented them as a question on Stackoverflow.

deadwards90 commented 2 years ago

@mekk1t out of interest, what infrastructure have you got?

We've only ever been able to narrow it down (we think) to a timeout waiting for a connection from the connection pool. That said upping the connection pool amount and timeout seems to make no difference.

DLS201 commented 2 years ago

Hello, Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

Regards,

johngwood commented 2 years ago

Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

@DLS201 any chance you have more information on this? Maybe a link to a bug or something? We are investigating this issue currently and would like to know what MSFT knows...

robsaa commented 2 years ago

I am having the exact same issue with this setup:

Microsoft.Data.SqlClient version: 4.1.0 .NET target: .NET 6.0 SQL Server version: Microsoft SQL Server Standard (64-bit) 15.0.2070.41 (on-premise) Operating system: Docker Container running on linux - mcr.microsoft.com/dotnet/aspnet:6.0

dbeavon commented 2 years ago

@DLS201 do you have a scope? Does it happen for any .net client code running anywhere in Azure?

In my case I'm having tons of problems getting a .Net solution to run correctly in Spark for Synapse.

I'm opening a support case as well. But you could save us all many, many days of time with CSS if you would point us in the right direction. Generally CSS seems to not be highly technical, and they seem just as far removed from the Azure product groups as I am!

Interestingly I can run my spark cluster locally, outside of azure and I have no issues connecting to the same SQL database over expressroute. It is only when my .net code also runs in azure that I have trouble connecting to Azure SQL. This seems counter-intuitive. (Another odd datapoint - the exact same code of mine works fine when running on azure-databricks VM's , which means it may be a problem related specifically to how synapse interacts with azure private resources)

I suspect the problem is with the newer/proprietary layers that Microsoft has been putting on top of their VNET's these days (ie. "managed vnets", and "managed private endpoints"). I have noticed other azure services are extremely flakey as well (eg. Power BI's managed-vnet-gateway and ADF pipelines are pretty unpredictable. I have opened support tickets for both of these as well.)

MichelZ commented 1 year ago

@DLS201 Please elaborate! We're seeing this issue from time to time, too in Azure AKS running aganst Azure SQL Elastic Pool DB's.

cheenamalhotra commented 1 year ago

Would appreciate more details about driver namespace + version + target framework in use, to help us identify, track and unblock you with known issues.

MichelZ commented 1 year ago

You mean like:

Microsoft.Data.SqlClient v4.1.1
.NET 6
Docker aspnet:6.0-bullseye-slim
Host image: AKSUbuntu-1804gen2containerd-2022.08.29

MichelZ commented 1 year ago

OK, so going more down the troubleshooting rabbithole and reading up on some things (especially in connection with AKS), it seems our issue with this was actually SNAT port exhaustion on the outbound load balancer for the AKS cluster.

Diagnose port exhaustion: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-standard-diagnostics#how-do-i-check-my-snat-port-usage-and-allocation

https://docs.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection

What it looks like:

We have increased the ports per backend instance (to a ridiculously high number :) ) - so we hope to never see this again

https://docs.microsoft.com/en-us/azure/aks/load-balancer-standard#configure-the-allocated-outbound-ports

MichelZ commented 1 year ago

Actually, it happened again, and this time without hitting the SNAT limit :(

cheenamalhotra commented 1 year ago

Since you're able to reproduce the issue in AKS, does it reproduce the exceptions in docker containers or Unix env locally as well? Would it be possible to share a minimal repro here to investigate?

robsaa commented 1 year ago

Came across this website while searching for a solution:

https://support.optimizely.com/hc/en-us/articles/4432366206733-CMS-12-site-crash-due-to-SQL-timeout-error-when-working-in-CMS-edit-mode

Can it be related to ThreadPool.MaxThreads ?

I have not tried to adjust MaxThreads myself yet.

hendxxx commented 1 year ago

<ConcurrentGarbageCollection>false</ConcurrentGarbageCollection> <ThreadPoolMinThreads>4</ThreadPoolMinThreads> <ThreadPoolMaxThreads>200</ThreadPoolMaxThreads> <ServerGarbageCollection>false</ServerGarbageCollection>

same here, this is my config... still happen :( why???

maxafu commented 1 year ago

I'm having the same issue. Any solutions?

deadwards90 commented 1 year ago

@maxafu we never got anywhere with it but I'm no longer with the company this was reported on. I believe people there are still watching this issue though.

We were hoping that what @DLS201 told us was true and MS were going to fix it this quarter. Every other avenue of investigation failed basically.

lcheunglci commented 1 year ago

<ConcurrentGarbageCollection>false</ConcurrentGarbageCollection> <ThreadPoolMinThreads>4</ThreadPoolMinThreads> <ThreadPoolMaxThreads>200</ThreadPoolMaxThreads> <ServerGarbageCollection>false</ServerGarbageCollection>

same here, this is my config... still happen :( why???

I posted a reply in #647 and the TL;DR: it might be related to thread pool starvation and to mitigate that is to increase the ThreadPoolMinThreads as a temp solution, and unfortunately, we currently don't have a fix for it yet.

dazinator commented 1 year ago

Turns out our issue was simply a genuine long running (dynamic based on filters the user selects on front end) query hitting the timeout with some filter selections. We had assumed (that magic word) the particular query involved (which happens sporadically) was ok because similar (but not identical) queries were ok. When we intercepted the T-SQL of a query that reached the timeout and re-ran it exactly as is we replicated the timeout and perf issue. Slightly embaressing but there you go - make sure you log the exception and try to intercept the exact T-SQL and re-run it to make sure it's not just a genuine query perf issue ;-)

deadwards90 commented 1 year ago

Just so the waters are not muddied, I can safely this that was not the cause of the issue we saw. DTUs would not budge, which you'd expect if a query was underperforming, and the queries themselves were incredibly simple and running fine on either side.

SQL Server also did not report seeing the timeouts which you'd expect for a genuine timeout.

dazinator commented 1 year ago

@dantheman999301 Forgive me for doing this, just wanted to play devils advocate in the small chance it might help (very small) you, or someone else.

DTUs would not budge, which you'd expect if a query was underperforming

In our case, our DTU's were well within limits. The only requirement for this to occur on our side was a query to be "long running" past the timeout set by the client. There is no specific requirement that the query be CPU, IO or memory intensive - although I agree you would assume the two would go hand in hand.

and the queries themselves were incredibly simple and running fine on either side

Suggests you were looking at multiple queries, could it be that the specific query that caused the timeout has a slightly different query plan to the rest (for example this is commonly the case with queries that have IN clauses where the the IN clause portion changes with the values provided from query to query so sql server has to regenerate query plans). The queries themselves may seem simple and alike, but if the query plan is different from invocation to invocation, some specific queries could be drastically different in performance whilst appearing to look similar in nature.

SQL Server also did not report seeing the timeouts which you'd expect for a genuine timeout.

Does SQL Server report timeouts that are due to the client's own timeout settings? I honestly don't know but it would surprise me if it did. When connecting via EF Core (ADO.NET) we set a timeout on the client of 30 seconds. Most queries ran < 1s. Occassionally a very "similar" query would take > 30s and we'd get this timeout. This lead to the illusion that it couldn't possibly be the query itself.

Basically what I'd say is, be sure you've captured a specific query that caused the timeout, and then re-execute it exactly as is to make sure it definitely isn't a perf issue with that query.

deadwards90 commented 1 year ago

@dazinator SQL Server does report on long-running queries, hence why we assumed that the error was happening at the application level. I don't have the information to hand anymore having left the company but given these were 30+ seconds of waiting, it would have shown up in some of the information tables SQL Server provides.

We spent a considerable amount of time looking into the causes, and it was happening in multiple services for different queries and in different environments, some of which had little to no data that would even cause a 30+ second slowdown.

I understand the need to be thorough but we had run through a lot of the steps you have outlined. I just don't want this ticket to get lost in the ether because people assume it's laziness on our part.

jdudleyie commented 1 year ago

We've had this same issue multiple times. Had a ticket open with Microsoft on the SQL side for weeks, in the end they could not find any long running queries, although we had Application Insights reporting some queries taking over 2 minutes, which is also not possible because our .NET SQL command timeout was set to 30 seconds in most cases, and never more than 1 minute. So they ruled out any SQL issue.

We had a separate ticket open with the App Service team (we run on App Service with Linux Docker containers) - and it also completely stumped them - no issues with threadpool, networking etc. In the end I have had to scale up to many more instances than should be necessary, and the issue has not recurred but each instance is only able to handle about 6 requests per second which is super low, given the average API request response time is under 500ms. So no idea of the root cause of this.

When the issue occurred, on our 24 vCore provisioned Azure SQL database, only about 10% usage was reported, so we were nowhere near database resource limits.

krompaco commented 1 year ago

Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

@DLS201 any chance you have more information on this? Maybe a link to a bug or something? We are investigating this issue currently and would like to know what MSFT knows...

Yes, more information would be nice...

We have found exception count to go down in our app after upgrading to 5.1.1 (from 5.0.1 in our case).

dbeavon commented 1 year ago

In fairness to @DLS201 the support engineers at Microsoft absolutely HATE giving out reference identifiers for their bugs, and they especially don't want to share anything that will be posted on the Internet. There is no concept of a KB database anymore either, like in the good old days. Microsoft doesn't seem to have a formalized way of sharing the known bugs in the public. That would be way too helpful.

I suspect that the only information that is available to DLS201 is whatever is being tracked in his own CSS case. A CSS case number is very customer-specific, and there is probably no way for anyone else to benefit from the investigation that was done. Unless he was very persistent, he won't get any bug # from the PG or anything useful like that. The bug # is probably not something the PG would share either. Probably the best general-purpose identifier for the bug would be the "IcM#" that is used for communication between CSS and the PG. I've had some luck getting those in the past, and sharing them with others.

I don't know if it helps anyone, but my best guesses are either (1) a DNS overload issue because of misconfigured cache, or (2) a firewall issue on certain ephemeral ports that are infrequently used - ie. certain ports above 64000 , or (3) the access token is generated fine on the client side, but occasionally Azure SQL is having problems recognizing it on that side, perhaps because it contacts the AAD identity service for the tenant but the number of requests is in excess of the identity services limits so it rebuffs Azure SQL on the back end.

I don't typically have the issue when running my stuff on-premise - only in someone else's VM's under load (in a Synapse Spark cluster, for example).

satuday commented 1 year ago

Not sure if is the same issue, but Im also seeing this error with System.Data.SqlClient as well in .net core 6 running in linux AKS. Could it be the same underlying issue?

System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at System.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__126_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Dapper.SqlMapper.QueryAsync[T](IDbConnection cnn, Type effectiveType, CommandDefinition command) in /_/Dapper/SqlMapper.Async.cs:line 418

eisenwinter commented 1 year ago

Since I've struggled with the same issue I will share our findings as well, we had this thoroughly checked one of the most striking things was this only happends on nix containers (tested with a wide variety from debian based over alpine etc) it does not happend on windows machines or Windows containers, strangely we never experienced the issue on our staging system even with load tests the most crucial point was the sql server instance running was the same version as the production one tough it was a Linux host instead a Windows host (or maybe the load tests weren't heavy enough but it's kind of strange) what really mitigated the issue with the nix containers was upgrading the sql instance, we tried a lot of other routes and like switching Sql Client versions etc but it didn't really affect the issue, our suspected root cause of this is probably some difference in the Linux / Windows networking stack but as for now we won't do any further investigations as the sql server upgrade strangely solved the issue somehow, hope this information somehow helps

Edit to clarify we had this issue on prem not with cloud services and the production sql server was running on a windows host

dazinator commented 1 year ago

Just a note to say that despite my comment above about genuine timeouts.. we still experience this issue regularly and have no idea as to the reason / cause. This is the most frustrating part of the issue is even knowing how to trace it. Expecially when you are using AzureSQL as we do not control that side of the infrastructure. We also run our workloads on Linux docker containers and this being a problem on Linux seems to be a common theme. @eisenwinter were you running on Linux docker containers? Or just directly on Linux VM's?

eisenwinter commented 1 year ago

@dazinator we only tested containers on different Linux host Systems to see if it's possible related to a certain kernel version or not but seems like it's not, as said we stopped testing after the issue disappeared after migrating the sql server to a newer version, just wanted to share our experience maybe it helps others somehow

Henery309 commented 1 year ago

After struggling with this and a few other issues with Azure SQL, we decided to move to PostgreSQL. Luckily, we were in a position to move away from azure SQL.

tobyreid commented 1 year ago

I had a similar problem recently, the root cause was the introduction of a TransactionScope -> https://github.com/dotnet/SqlClient/issues/647#issuecomment-1602980797

jbogard commented 1 year ago

We are also experiencing this issue with Linux App Services against Azure SQL. We only began seeing this after migrating our App Services from Windows to Linux.

TroyWitthoeft commented 11 months ago

Same boat as others. In our situation we are using Azure Linux Function App (net6.0, System.Data.SqlClient@4.8.3, EF 6.0.1) connecting to an Azure SQL Server. Db showing no symptoms, vCore levels low. No long running queries. Everything was running fine for years... then pop intermittent Unknown error 258 throughout the day.

krompaco commented 11 months ago

Is anyone experiencing these errors with net7.0?

jbogard commented 11 months ago

Yes, we tried upgrading to .NET 7 and the latest SNI package release (5.1.1). It didn't fix it. The only thing that has made any difference was the suggestion to bump the min thread count. We bumped it to 20 and now very rarely see the issue (maybe 1/day or so).

Nothing seems to eliminate it so far but the thread count workaround does help.

TroyWitthoeft commented 11 months ago

To continue the conversation about this issue being related to Linux and to hopefully narrow in on an event ... our Azure Application Insight logs show that these random timeouts AND application performance issues all started after this Azure East US maintenance window QK0S-TC8 did SOMETHING to Azure functions host on August 17th 9:00PM EST. Something in that host update caused this unknown error 258 to start appearing.

At that point in time, our host went from Linux 5.10.177.1-1.cm1 to Linux 5.15.116.1-1.cm2 and application performance tanked shortly after, and we now have the sudden appearance of these Unkown error 258 throwing exceptions. Some metric digging shows that memory usage (P3V3 with 32GBs) tanked along with performance.

No code or db changes on our part, just suddenly the AppInsight logs show kernel 5.15, odd timeout errors, and we presumably can't access memory. @eisenwinter - you said you tried a few different kernels, did you go back to 5.10? Anyone else seeing this issue on 5.10 or lower?

Update: We converted the function app over to Windows. The sql timeout errors are gone and application performance is restored! 🎉

krompaco commented 11 months ago

We lifted to .NET 7 but still see occurrences.

dotnet / SqlClient