Open aaronhigh-loyal opened 9 months ago
@aaronhigh-loyal how often do you see the issue? Is it happening intermittently?
@JRahnama , We encounter this issue 100% of time one of the "pre-requisite reproduction steps" occurs. I suspect that there may be other "triggering events" but these are the two we experience the most frequently. For reference, those events are:
Scaling an elastic pool Automated Azure SQL maintenance
I would suggest contacting the Azure support team, as they can provide you with a quicker response.
@JRahnama We have an open ticket with them which has no resolution. I opened this ticket as I have a suspicion that this is some kind of invalidated cache in SqlClient causing a downstream failure in the enclaves when a DB restarts. If you think that's incorrect, feel free to close this ticket and I'll continue my dialog with them.
@aaronhigh-loyal I cannot determine whether this is a SqlClient issue without further investigation. While examining GitHub repository issues may take some time, especially for urgent cases, if the issue is consistently reproducible, I can assume that we have a repro available. We will investigate this further and get back to you if the assumption proves to be correct.
Can you provide stack trace with complete error message please?
@JRahnama There is no stack trace beyond the Kestrel error noted above, no exception is surfaced. SqlClient fails silently and the only other supporting evidence comes from the AzureSQL activity logs.
Exception message (Application Side): [05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L7U' application never completed. [05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L7V' application never completed. [05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L87' application never completed.
Exception message (Database Side): Internal enclave error. Enclave was provided with an invalid session handle. For more information, contact Customer Support Services. The service has encountered an error processing your request. Please try again. Error code 33195.
Hi, is this bug been fixed? Any update?
Hello, is there any fix under development?
@aaronhigh-loyal, In order to have a complete analysis of this issue, could you kindly provide a repro, please?
Would configuring a retry logic to make your application more flexible not applicable this scenario? Is the application using Kestrel web server and EFCore? What is the scope of the DbContext, Singleton, Scoped or Transient?
we are facing same issue, our pods are deployed in AKS and in the pod logs we see Kestrel error "...application never completed", checking Azure SQL logs we can also see "Internal enclave error...". On the application side, there are no errors or exceptions that can be used to implement retries.
The problem seems to resolve itself after several hours or by restarting the pods. We are facing this issue after azure automated maintenance events as described in this issue.
The application is using Kestrel web server and EFCore. The scope of the DbContext is Scoped. Below I list the versions of the packages used. Microsoft.EntityFrameworkCore 7.0.5 Microsoft.Data.SqlClient 5.1.5
Describe the bug
Connectivity is permanently lost when accessing an Azure SQL database on DC-series hardware with SGX enclaves with SqlClient after an "event" on the Azure SQL database. This results in buried exceptions in SqlClient and enclave errors from the database. No exceptions are surfaced in the calling application and the only application-side manifestation (beyond failure) are the below Kestrel errors.
To restore connectivity, all impacted applications using SqlClient must be restarted.
To reproduce
This example is based on the assumption that the database being connected to is an Azure SQL DB on DC-series hardware with SGX Secure Enclaves. This works under nominal conditions, and consistently fails when one of the below pre-requisite repro steps are taken.
Pre-requisite reproduction steps (one of the following actions must be taken, there may be other triggering events, but these have been observed to cause it to date):
Expected behavior
Further technical details
Microsoft.Data.SqlClient version: 5.1.x .NET target: .NET 6, 8 SQL Server version: Azure SQL Database, DC-series hardware, SGX secure enclaves, elastic pool. Operating system: (e.g. Windows 2019, Ubuntu 18.04, macOS 10.13, Docker container)
Additional context There are likely other triggering events from the Azure SQL side to reproduce this issue, but the two we've noted thus far are:
When either occur, all applications accessing these DBs with SqlClient must be restarted. All applications are hosted in AKS clusters.