dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
837 stars 278 forks source link

When using SqlColumnEncryptionAzureKeyVaultProvider with an expired client secret, bizarre error messages and symptoms result #2683

Closed jehhynes closed 2 days ago

jehhynes commented 1 month ago

When using SqlColumnEncryptionAzureKeyVaultProvider with an expired client secret, the following error is returned:

SqlException: New request is not allowed to start because it should come with valid transaction descriptor.

This error is very confusing and has nothing to do with the actual problem. After seeing this error, we then see the following error on all queries even when column encryption is not enabled on the command:

IndexOutOfRangeException: SOME_COLUMN_NAME
    at Microsoft.Data.SqlClient.SqlDataReader.GetOrdinal(string name)

(Note that the query actually does select SOME_COLUMN_NAME)

Full stack trace

SqlException: New request is not allowed to start because it should come with valid transaction descriptor.
Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, bool breakConnection, Action<Action> wrapCloseInAction)
Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, bool breakConnection, Action<Action> wrapCloseInAction)
Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, bool callerHasConnectionLock, bool asyncClose)
Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, out bool dataReady)
Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, string resetOptionsString, bool isInternal, bool forDescribeParameterEncryption, bool shouldCacheForAlwaysEncrypted)
Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, bool returnStream, bool isAsync, int timeout, out Task task, bool asyncWrite, bool inRetry, SqlDataReader ds, bool describeParameterEncryptionRequest)
Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, bool returnStream, TaskCompletionSource<object> completion, int timeout, out Task task, out bool usedCache, bool asyncWrite, bool inRetry, string method)
Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
NHibernate.AdoNet.AbstractBatcher.DoExecuteReader(DbCommand cmd)
NHibernate.Loader.Loader.GetResultSet(DbCommand st, QueryParameters queryParameters, ISessionImplementor session, IResultTransformer forcedResultTransformer)
NHibernate.Loader.Loader.DoQuery(ISessionImplementor session, QueryParameters queryParameters, bool returnProxies, IResultTransformer forcedResultTransformer, QueryCacheResultBuilder queryCacheResultBuilder)
NHibernate.Loader.Loader.DoQueryAndInitializeNonLazyCollections(ISessionImplementor session, QueryParameters queryParameters, bool returnProxies, IResultTransformer forcedResultTransformer, QueryCacheResultBuilder queryCacheResultBuilder)
NHibernate.Loader.Loader.DoList(ISessionImplementor session, QueryParameters queryParameters, IResultTransformer forcedResultTransformer, QueryCacheResultBuilder queryCacheResultBuilder)

Expected behavior

I expect a clear error message to be returned which states that column encryption was not possible due to the expired client secret I do not expect it to "pollute" the connection pool and make all requests to the same database fail with bizarre error messages.

Further technical details

Microsoft.Data.SqlClient version: 5.1.4 .NET target: .NET 8.0 SQL Server version: 2022 Operating system: Windows 11

Additional context

<PackageReference Include="Microsoft.Data.SqlClient.AlwaysEncrypted.AzureKeyVaultProvider" Version="3.0.0" /> NHibernate 5.5.2

ErikEJ commented 1 month ago

Have you tried with 5.2.1 ?

jehhynes commented 1 month ago

SqlClient 5.2.1 does the same thing.

mdaigle commented 3 weeks ago

Hi @jehhynes thanks for filing this issue. I'm working on reproducing it. In the meantime if you already have a minimal repro you can share that would be much appreciated.

jehhynes commented 3 weeks ago

Hi @mdaigle I could create a minimal repro but it's dependent on a specific state in an external system (i.e. Expired Azure client secret) which I would be unable to share.

mdaigle commented 3 weeks ago

Understood. Even with that limitation it will be helpful to see.

jehhynes commented 3 weeks ago

I haven't been able to reproduce this with straight SqlClient usage. I'm wondering if the issue has something to do with the way that NHibernate is interacting with SqlClient.

arellegue commented 3 weeks ago

@jehhynes Could you share the repro, please?

jehhynes commented 3 weeks ago

This repro's consistently in my application, but I have not been able to figure out how to reproduce an identical issue outside of my application. I have however reproduced a similar issue. It seems to be caused by a more complex combination of factors than I originally thought.

The error is occurring in an ASP.NET Core app, and I think it may have to do with multithreading. Upon further investigation, this error is initially received when rendering a view (using Html.PartialAsync):

Failed to decrypt column 'ssn59_479_'. Failed to decrypt a column encryption key using key store provider: 'AZURE_KEY_VAULT'. Verify the properties of the column encryption key and its column master key in your database. The last 10 bytes of the encrypted column encryption key are: 'C1-BA-F1-E9-5C-97-6D-39-D7-F6'. ClientSecretCredential authentication failed: A configuration issue is preventing authentication - check the error message from the server for details. You can modify the configuration in the application registration portal. See https://aka.ms/msal-net-invalid-client for details. Original exception: AADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app '6859ea32-a175-4d32-9511-6f0d40637e1a'. Trace ID: 9989e0a9-5487-4530-9934-286a2f597c00 Correlation ID: ca4e081b-04d1-47bc-b5d3-4514f6e2a3fa Timestamp: 2024-08-20 19:33:12Z

Then another database query occurs from an IResultFilter attribute (in OnResultExecuted), which triggers this error:

New request is not allowed to start because it should come with valid transaction descriptor.

It seems the first error (which is the expected error) is hidden (presumably due to AspNetCore's multithreaded behavior) and instead the second error is show (which is not expected).

Based on these theories, I attempted to create a minimal repro using AspNetCore (attached). Although I wasn't able to get the exact same exception, I did encounter some strange/unexpected behavior/errors so I think it's starting to touch on this issue and may provide you what you need at least to begin troubleshooting.

The repro project

AlwaysEncryptedRepro.zip

Thanks!

jehhynes commented 3 weeks ago

Update:

I discovered that by adding a second attribute implementing IResultFilter.OnResultExecuted, we uncover an additional sequence of bizarre exceptions that you can observe by refreshing the page repeatedly:

image

image

image

image

The updated repro project: AlwaysEncryptedRepro.zip

jehhynes commented 3 weeks ago

Update 2

So I further discovered that by making the ISessionFactory static (instead of rebuilding it with every request) I get a different sequence of errors:

image

image

image

This sequence is similar to what I observed in my application - see my StackOverflow question here: https://stackoverflow.com/questions/78759572/sqlexception-new-request-is-not-allowed-to-start-because-it-should-come-with-va

The updated repro project:

AlwaysEncryptedRepro.zip

mdaigle commented 3 weeks ago

Thanks for the repros @jehhynes. I've been working through them and have a couple avenues for investigation.

Just to double check, are you running each of these repros with pooling enabled? And same question for the original issue in your application, is pooling enabled there?

If you can, can you tell me if the issue still repros when you disable pooling?

Thanks!

jehhynes commented 3 weeks ago

If I disable pooling, the bizarre errors go away.

mdaigle commented 2 weeks ago

This is a bug that is leaving connections in an inconsistent state with unread bytes still hanging on the wire. When pooling is enabled, the inconsistent connections are recycled back through to your application and result in undefined error behavior for new connections with the same connection string. The unread bytes off the wire from the last command interfere with the results for the new command.

A fix was merged to main last month and it looks like it's slated for the 6.0 release: #2618

David-Engel commented 2 weeks ago

Fixes for this have been ported to the 5.1.6 (#2818) and 5.2.2 (#2817) hotfixes that are coming out soon.

dauinsight commented 2 weeks ago

Hotfixes 5.1.6 and 5.22 have been released!

jehhynes commented 2 weeks ago

@mdaigle @David-Engel I installed 5.2.2 on the repro project and re-tested. It now gives the following exception after waiting 30 seconds and the command timing out:

SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

I don't think this is a correct fix - the error message I expect is the Failed to decrypt column message

mdaigle commented 2 weeks ago

What I'm seeing is that DrainData() is double counting the encrypted column.

Here's the sequence of events:

  1. column header read
  2. sharedState._nextColumnHeaderToRead++ (incremented from 0 to 1)
  3. column data read
  4. column data decryption (fails)
  5. data drained

So at the time we drain the data, we've already read all of the bytes for the encrypted column off of the wire or out of the buffer, but we've not incremented the _nextColumnDataToRead value from 0 to 1. This value is only incremented after we exit the scope of TdsParser.TryReadSqlValue(). DrainData() compares _nextColumnDataToRead to _nextColumnHeaderToRead to determine if we stopped part way through reading a column. If _nextColumnDataToRead < _nextColumnHeaderToRead, it assumes that the column data still needs to be read off the wire/out of the buffer. But we did already read out the column data. So when DrainData() tries to flush enough data for the length of the column, it will fail. Either this was the last column and there's no data available or we get into an inconsistent state where we read some of the next column but treat it as if it's this column.

If there's not enough data available, we wait on the socket for more data which will never arrive. This leads to a timeout, which is thrown in place of the column decryption error.

mdaigle commented 2 weeks ago

I have a draft PR here (#2825) that, in local testing, fixes the issue. Using this change with the repro produces the expected error. I'm still looking into the best way to add tests for this to prevent regressions in the future.

kf-gonzalez2 commented 2 days ago

Closing issue as PR has been merged