SqlClient fails to connect to SGX Secure Enclave backed databases after database events

aaronhigh-loyal commented 9 months ago

Describe the bug

Connectivity is permanently lost when accessing an Azure SQL database on DC-series hardware with SGX enclaves with SqlClient after an "event" on the Azure SQL database. This results in buried exceptions in SqlClient and enclave errors from the database. No exceptions are surfaced in the calling application and the only application-side manifestation (beyond failure) are the below Kestrel errors.

To restore connectivity, all impacted applications using SqlClient must be restarted.

Exception message (Application Side):
[05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L7U' application never completed.
[05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L7V' application never completed.
[05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L87' application never completed.
Stack trace (Application Side):
same as above, no exceptions.

Exception message (Database Side):
Internal enclave error. Enclave was provided with an invalid session handle. For more information, contact Customer Support Services.
The service has encountered an error processing your request. Please try again. Error code 33195.
Stack trace (Database Side):
n/a

To reproduce

This example is based on the assumption that the database being connected to is an Azure SQL DB on DC-series hardware with SGX Secure Enclaves. This works under nominal conditions, and consistently fails when one of the below pre-requisite repro steps are taken.

Pre-requisite reproduction steps (one of the following actions must be taken, there may be other triggering events, but these have been observed to cause it to date):

Scale an elastic pool containing the database from x vCores to y vCores
Have an Azure automated maintenance event occur. This should incur minimal downtime as described, but SqlClient never recovers from this.

await using var conn = new SqlConnection("Connection string to AzureSQL with Enclave Attestation");
var command = new SqlCommand("select top 10 * from MyTable", conn);
await using var reader = await command.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
    // logic
}

Expected behavior

SqlClient to successfully make a request and not require a full restart of the owning application.
Surfacing of any exceptions from SqlClient

Further technical details

Microsoft.Data.SqlClient version: 5.1.x .NET target: .NET 6, 8 SQL Server version: Azure SQL Database, DC-series hardware, SGX secure enclaves, elastic pool. Operating system: (e.g. Windows 2019, Ubuntu 18.04, macOS 10.13, Docker container)

Additional context There are likely other triggering events from the Azure SQL side to reproduce this issue, but the two we've noted thus far are:

Scaling an elastic pool
Automated Azure SQL maintenance

When either occur, all applications accessing these DBs with SqlClient must be restarted. All applications are hosted in AKS clusters.

JRahnama commented 9 months ago

@aaronhigh-loyal how often do you see the issue? Is it happening intermittently?

aaronhigh-loyal commented 9 months ago

@JRahnama , We encounter this issue 100% of time one of the "pre-requisite reproduction steps" occurs. I suspect that there may be other "triggering events" but these are the two we experience the most frequently. For reference, those events are:

Scaling an elastic pool Automated Azure SQL maintenance

JRahnama commented 9 months ago

I would suggest contacting the Azure support team, as they can provide you with a quicker response.

aaronhigh-loyal commented 9 months ago

@JRahnama We have an open ticket with them which has no resolution. I opened this ticket as I have a suspicion that this is some kind of invalidated cache in SqlClient causing a downstream failure in the enclaves when a DB restarts. If you think that's incorrect, feel free to close this ticket and I'll continue my dialog with them.

JRahnama commented 9 months ago

@aaronhigh-loyal I cannot determine whether this is a SqlClient issue without further investigation. While examining GitHub repository issues may take some time, especially for urgent cases, if the issue is consistently reproducible, I can assume that we have a repro available. We will investigate this further and get back to you if the assumption proves to be correct.

JRahnama commented 9 months ago

Can you provide stack trace with complete error message please?

aaronhigh-loyal commented 9 months ago

@JRahnama There is no stack trace beyond the Kestrel error noted above, no exception is surfaced. SqlClient fails silently and the only other supporting evidence comes from the AzureSQL activity logs.

Exception message (Application Side): [05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L7U' application never completed. [05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L7V' application never completed. [05:34:47 FTL] Microsoft.AspNetCore.Server.Kestrel Connection id '0HMVNE3J76L87' application never completed.

Exception message (Database Side): Internal enclave error. Enclave was provided with an invalid session handle. For more information, contact Customer Support Services. The service has encountered an error processing your request. Please try again. Error code 33195.

pietrobr commented 2 months ago

Hi, is this bug been fixed? Any update?

fbellonireply commented 1 month ago

Hello, is there any fix under development?

arellegue commented 1 month ago

@aaronhigh-loyal, In order to have a complete analysis of this issue, could you kindly provide a repro, please?

Would configuring a retry logic to make your application more flexible not applicable this scenario? Is the application using Kestrel web server and EFCore? What is the scope of the DbContext, Singleton, Scoped or Transient?

MForghieri commented 1 month ago

we are facing same issue, our pods are deployed in AKS and in the pod logs we see Kestrel error "...application never completed", checking Azure SQL logs we can also see "Internal enclave error...". On the application side, there are no errors or exceptions that can be used to implement retries.

The problem seems to resolve itself after several hours or by restarting the pods. We are facing this issue after azure automated maintenance events as described in this issue.

The application is using Kestrel web server and EFCore. The scope of the DbContext is Scoped. Below I list the versions of the packages used. Microsoft.EntityFrameworkCore 7.0.5 Microsoft.Data.SqlClient 5.1.5

dotnet / SqlClient