dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
842 stars 280 forks source link

A transport-level error has occurred when receiving results from the server if query length exceeds 956 characters #2141

Open guyvaio opened 1 year ago

guyvaio commented 1 year ago

In MAUI Android using Microsoft.Data.SqlClient, it's not possible to run a query exceeding 956 characters.

Exception message: Stack trace: Microsoft.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - Success) at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) in D:\a\_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlConnection.cs:line 2010 at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 770 at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParser.cs:line 1404 at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 734 at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1783 at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1256 at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 1181 at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in D:\a_work\1\s\src\Microsoft.Data.SqlClient\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 904 at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs:line 446 at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParser.cs:line 2019 at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData() in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlDataReader.cs:line 1142 at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData() in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlDataReader.cs:line 258 at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 5157 at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4951 at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4621 at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4491 at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 2050 at Microsoft.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior) in D:\a_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 2000 at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior) at System.Data.Common.DbDataAdapter.FillInternal(DataSet dataset, DataTable[] datatables, Int32 startRecord, Int32 maxRecords, String srcTable, IDbCommand command, CommandBehavior behavior) at System.Data.Common.DbDataAdapter.Fill(DataSet dataSet, Int32 startRecord, Int32 maxRecords, String srcTable, IDbCommand command, CommandBehavior behavior) at System.Data.Common.DbDataAdapter.Fill(DataSet dataSet) at MercatorApi.Api.Zselect(SqlCommand cmd, Boolean withSchema, MercatorSqlParam[] mercatorSqlParams) in M:\Mercator_dotnet\MercatorTunnel\MercatorTunnel\Api\MercatorApiZselectSqlClientCore.cs:line 133 ClientConnectionId:7f5e8352-c020-4aaa-9a88-42ca61cfaa28

To reproduce

sqlCommand.CommandText = new string(' ', 948) + "select 1"; works fine.

sqlCommand.CommandText = new string(' ', 949) + "select 1"; gives the error.



### Expected behavior
Queries exceeding 956 characters should work.

### Further technical details
Microsoft.Data.SqlClient version: 5.1.1,
.NET target: MAUI net7.0-android
SQL Server version: SQL Server 2019
Operating system: Windows 11
guyvaio commented 1 year ago

Same error in Microsoft.Data.SqlClient version 5.2.0-preview3.23201.1

JRahnama commented 1 year ago

@guyvaio thanks for reporting this. We will check this and will get back to you.

JRahnama commented 1 year ago

@guyvaio just to confirm that you are not using any stored procedures in the command? It is just passing actual text as command?

guyvaio commented 1 year ago

No, reproduced with just n spaces followed by select 1. Latest tests on other db and SQL servers showed us 956 is not constant and should be sometimes increased.

jFrancoisH commented 1 year ago

I reproduced the bug. Here is the repository with the error https://github.com/jFrancoisH/TestSqlClientAndroid/tree/main

guyvaio commented 1 year ago

Thanks for the repo. The relevant code is in MainPage.xaml.cs. Exception occurs on line 34.

guyvaio commented 1 year ago

The same error occurs everywhere the query sent to SQL Server exceeds a certain size:

David-Engel commented 1 year ago

I tried a similar repro on Windows and all query sizes execute fine. So this appears to be limited to Android and/or .NET MAUI. We'll need to debug in that environment when someone gets time to set one up. There could just be in issue in one of the underlying libraries.

guyvaio commented 1 year ago

Yes, initial issues are observed in Android. After reading the MDS source code with my dev team (many time lost !) we deduce this error occurs in all « unix » based app. The win-nt (!) compilation still uses native code. The error occurs in the new TDS parser that, obviously, is not finished nor stable.

guyvaio commented 1 year ago

The place where it goes wrong is in src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SNI\SNIPacket.cs, line 258 _dataLength = stream.Read(_data, _headerLength, _dataCapacity); The stream is of type Microsoft.Data.SqlClient.SNI.SNISslStream.

If sqlCommand.Text has an accepted length, _dataLength is non zero. If sqlCommand.Text exceeds 956 (or more) characters, _dataLength is zero. From there, we enter a path of error.

I guess the problem is in the Read method of underlying SslStream.

Hope this can help.

guyvaio commented 1 year ago

Further investigations have shown that _dataLength is zero (see above comment) because SQL Server closes the connexion. When the query exceeds 956 (or more) characters, the request sent to SQL Server is malformed, SQL Server detects it and closes the connexion.

The error occurs in external code (seen in the stack) between line 318 of SNIPacket.cs

public void WriteToStream(Stream stream)
{
    stream.Write(_data, _headerLength, _dataLength);

and line 183 of SslOverTdsStream.NetCoreApp.cs

if (!_encapsulate)
{
    _stream.Write(buffer);
    _stream.Flush();
    return;
}

We guess that the SSL encryption occurs between these 2 lines and that the encryption produces a wrong message.

If in the connection string, Encrypt = false is forced and SQL Server does not require encryption, the error does not occur. The stream of the SNIPacket is of type SNINetworkStream (not a SNISslStream), so encryption does not occur anymore and everything goes wel.

Hope this can help.

Credits to @jFrancoisH for joined efforts to understand this mess.

guyvaio commented 1 year ago

I do confirm there is no problem in iOS.

David-Engel commented 1 year ago

@guyvaio Could there be a device between the client and server that is inspecting or manipulating the network packets? Maybe a security device trying to identify or sanitize potentially malicious SQL? I've seen it before. They can be very difficult to troubleshoot. Or maybe some security software? I could imagine that if something was modifying the encrypted data, the server would reject it and close the connection.

guyvaio commented 1 year ago

@David-Engel Thanks for your comment.

All tests have been done in a private LAN. The same code works in the same infrasctructure with

It seems to me that the conclusion we must draw is that there is a bug in the Android platform specific code that performs encryption.

last-Programmer commented 10 months ago

I am also facing this issue when the SqlConnection is used by SMO in android and then i try to script a table i get this error.

image

The same code works fine with iOS.

Workaround

**setting

builder.Encrypt = SqlConnectionEncryptOption.Optional;

Solves the problem for me in android. But i am not doing this for iOS.**

awakecoding commented 10 months ago

We have recently upgraded to .NET 8 for Remote Desktop Manager (RDM) on Android after TLS-related issues were finally fixed, only to start getting error reports from different customers for Microsoft.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - Success).

image

I took the time to set up RDM Android with the latest Visual Studio 2022 Preview to launch it in WSA, and used a test VM with SQL Server as my RDM SQL Server data source with TLS encryption enforced server-side. I then used instructions for TLS pre-master secret dumping from LSASS to decrypt traffic on port 1433 from the server in Wireshark.

RDM Windows can connect to the same SQL Server data source without problem - only RDM Android is affected, and it's very easy to reproduce, as we do a bunch of queries after the initial connection that go over the weird limit observed. At first sight, there doesn't seem to be an obvious issue with the query size, or even the response size:

image image

However, looking at the last query sent from the client to the server, we can observe an important difference: the query was fragmented in two TLS records (2048+307 bytes). This isn't TCP fragmentation of a single large TLS record like the previous query, and the fact that it has been split in two TLS records appears to have triggered a server-initiated disconnection upon reception:

image image

Normally, an application should be able to handle this nicely, but there's potentially something unhandled there. We do know that the issues only occur with TLS, and that TLS issues were only recently fixed. In my past experience with FreeRDP, I encountered similar weirdness between OpenSSL and SChannel - for instance, zero-length TLS records had to be disabled. Here we seem to be hitting a problem related to the way a single query message is fragmented over multiple TLS records. What about platforms where the issue doesn't happen, like RDM Windows? The equivalent query is sent as a single TLS record, and the server doesn't drop the connection!

image

Applications normally handle such fragmentation differences correctly, but I really wouldn't be surprised to learn that SQL Server + SChannel doesn't handle this properly. I've had issues in the past with FreeRDP for CredSSP+OpenSSL interoperability where we had to disable zero-length TLS records because SChannel couldn't handle them. SQL Server probably has similar weird unhandled cases. I have no idea what triggers the fragmentation at the TLS record level in .NET 8 for Android, but the fragment size appears to be set at 2048, which is close-enough to a query of 956 characters encoded in UTF-16. The numbers fit, now we just need someone more knowledgeable of the stack to take a closer look at why it's behaving the way it is right now, and see how queries could be sent as a single TLS record fragmented at the TCP level instead of sending it as multiple TLS records, and I think it should resolve the issue.

JRahnama commented 10 months ago

@awakecoding if I got it correctly, Server Hello is requesting Certificate from the client. What are pre-login messages sent to the server?

JRahnama commented 10 months ago

Also, what connection string properties are being used?

awakecoding commented 10 months ago

@JRahnama I can send decrypted Wireshark captures showing the actual traffic without TLS record fragmentation that works from Windows, and TLS record fragmentation causing SQL Server to disconnect the Android client, GitHub just doesn't allow .pcapng file attachments. There is no client certificate authentication involved, and the connection string doesn't matter, what matters is the issue always happens with TLS and query strings around 956 characters because they produce messages over 2048 bytes that somehow only get sent as multiple TLS records on Android, and no other platform. We suspect this would be an issue within the .NET runtime combined with SQL Server being unable to handle query messages split in multiple TLS records.

I've been digging all morning trying to get SQL Server to produce traces to get a hint as to why it disconnects after receiving the query in split TLS records, but I couldn't get it to be verbose enough for that. This is a critical issue for us and we're looking for a swift response - and pretty much everyone else in this thread has been able to reproduce it easily: .NET 8, SQL Server with TLS, and a long query string. That's it, boom. We were able to hit the issue with a smaller sample like the one that was shared by @jFrancoisH, and our analysis now points in the same direction as @guyvaio - https://github.com/dotnet/SqlClient/issues/2141#issuecomment-1731215123

Knowing that at the protocol level, the fragmentation happens at 2048 bytes, the original magical value of 956 characters becomes less relevant. All you really need is a query resulting in a protocol message going over 2048 bytes, for which the end result is a single query message sent over multiple TLS records rather than a single TLS record fragmented at the TCP level. I could not observe the same behavior as Android from any other platform, and only Android is affected. I tried finding a hint in the SqlClient source code, and I don't see an obvious link with the fragmentation.

However, there are many hints that point towards the .NET runtime implementation of SslStream: as @guyvaio reported earlier, it happens in external code (the .NET runtime here), in between lines from SqlClient. This makes a lot of sense given the poor state of SslStream for Android prior to .NET 8, it only just got fixed. This is not battle-tested code, there's been churn in that area very recently, so bugs like this would not be a surprise.

In the .NET runtime, we can see the value 2048 used as the InitialBufferSize for SslStream. When _stream.Write() is called from SqlClient, it calls SSLStreamWrite from pal_sslstream.c. The managed wrapper of the native Android SSLStreamWrite is in Interop.Ssl.cs.

I've tried to look and compare the various platform-specific implementations to find a hint on what might cause the difference in behavior, but I could not find it yet. However, Wireshark doesn't lie: those long queries are only split in multiple TLS records with .NET on Android, and all other platforms work. I'm still looking for undeniable proof that this is the reason SQL Server drops the connection, but from observation, there's a direct cause and effect relationship: if you stick to queries that result in messages smaller than 2048 bytes over the wire, the queries get sent as a single TLS record, and everything works fine. Other platforms send queries over 2048 as a single TLS record, and they work fine.

Would it be possible to increase the priority on this issue and maybe get people from the .NET runtime involved? If you know people from the SQL Server team, maybe this could help confirm the theory about queries sent over multiple TLS records not being handled properly?

JRahnama commented 10 months ago

@awakecoding we can look into this, but there is a release coming up on December 7th and have limited time up until that time.

@wfurt could this be something related to your team?

awakecoding commented 10 months ago

@JRahnama @wfurt here are the Wireshark captures (the link is only good for 24 hours), they have the TLS pre-master secrets embedded into the .pcapng, so Wireshark will decrypt TLS traffic. It's all test data so don't worry, but we can see that from Windows, we don't get queries split in multiple TLS records, while from Android, the last message before SQL Server disconnects the client is a query split in two TLS records. One thing I noticed is that SQL Server simply closes the TCP connection without even bothering to send a TLS alert, but it still closes it nonetheless: https://wormhole.app/5E9L3#EETArv5SLuAlwOy74lWqMw

awakecoding commented 10 months ago

I've got further proof of the direct relationship between long queries split multiple TLS records and the server-initiated TCP disconnection: I caught it in WinDBG while inspecting traffic in Wireshark. The moment I resumed execution to let sqllang!Tcp::FCloseTransport complete, I saw the TCP FIN packet sent by the server. This is definitely the call stack in SQL Server.

image

During my testing, I sometimes saw the split TLS records sent in multiple packets - sometimes they're batched together, but the end result is still the same. I can put a breakpoint on either sqllang!Tcp::FCloseTransport or sqllang!Tcp::FProcessSessionKill to break earlier on the call stack leading to the disconnection. There's some check done somewhere in sqllang!process_messages that results in an error condition, but with WinDBG disassembly I couldn't figure out much, and in IDA, it's a bit hard to follow:

image

wfurt commented 10 months ago

cc @simonrozsival for the Android angle. (it may be worth of trying .NET 8 as there were many Android improvements) From my prospective splitting TLS to multiple segments should be just fine ... assuming the data above are not mangled. If this creates problem it would look like server bug to me. If for example somebody would assume all data would arrive in single read then splitting to multiple segments may break it. But that assumption would be fragile and unreliable IMHO.

We could dump more data from schannel or SslStream on the server to see if there is any TLS processing error or if the close comes from the SQL itself.

awakecoding commented 10 months ago

While I agree that SQL Server should handle TDS RPC requests split into multiple TLS records or "messages", I think I just found the corresponding hint that it is not handled in MS-TDS:

Screenshot_20231125-114153.png

To execute an RPC on the server, the client sends an RPCRequest data stream to the server. This is a binary stream that contains the RPC Name (or ProcID), Options, and Parameters. Each RPC MUST be contained within a separate message and not mixed with other SQL statements.

Seeing how the protocol security is heavily tied to SSPI for non-TLS and then with SChannel SSPI on Windows for TLS, to me this is indicative that it is expecting requests to come in 1) individually and 2) as a complete message. It doesn't say it explicitly but it is most likely the byproduct of this condition.

Of course the final confirmation would need to come from the SQL Server team, but in the meantime, what we need is the quickest workaround to avoid the fragmentation of TDS RPC messages (type 3) into multiple TLS records. I don't know if the 2048 initial buffer size pointed to earlier is truly responsible for the fragmentation, we'd need a way to at least double it to work around the problem until a more permanent solution can be found.

awakecoding commented 10 months ago

Success! I finally managed to prove my point by increasing the value of InitialBufferSize from 2048 to 4096, rebuilt the .NET runtime from the v8.0.0 tag, then manually copied over my patched System.Net.Security.dll in the following locations, which is where it was picked up by Visual Studio according to ProcMon, and surprise, surprise: Remote Desktop Manager works again, no issue resulting from broken SQL queries, because the long TDS RPC requests no longer get fragmented. Of course, the issue would still happen for requests longer than 4096 bytes, but I feel like just doubling the buffer size is good enough to fix this blocking issue (some of our customers are unable to use RDM Android anything until we fix this).

I don't really know if there's a better way to swap .NET runtime assemblies for a custom build, but here are the paths on Windows where I replaced System.Net.Security.dll with my locally built patched copy:

Despite having a separate copy for reach CPU architecture, the DLLs are exactly the same in all of those folders.

Here's a zipped copy of my patched DLL for convenience: System.Net.Security.dll.zip

And here's a patch file, but it's just a matter of changing private const int InitialBufferSize = 2048; to something higher: workaround-for-dotnet-sql-client-issue-2141.patch

I hadn't realized that we could attach zip files to GitHub comments until now, so here are the decrypted Wireshark captures mentioned earlier for my initial analysis of the issue: rdm-android-mssql-failure.zip rdm-windows-mssql-success.zip

TL;DR: SQL Server requires TDS RPC requests to be sent as a single TLS record, and the protocol specification hints about this limitation. The initial buffer size of 2048 bytes used in the .NET 8 runtime SslStream stack results in fragmentation of messages into multiple TLS records, breaking SqlClient on Android when sending queries of about 956 characters (header size + UTF-16 string encoding makes it close to 2048 bytes). The only solution is to avoid splitting SQL queries (TDS RPC requests) in multiple TLS records, of face immediate disconnection from SQL Server.

Increasing the initial buffer size to 4096 is sufficient to work around the issue for Remote Desktop Manager on Android: image

Since we need it a fix now, I'll look into overwriting System.Net.Security.dll in our CI build runners. I would be fine with just increasing the initial buffer size, but I somehow doubt the .NET runtime time will accept this as a proper fix. We should look into the best way to work around the issue in the short term until a more permanent solution can be found.

last-Programmer commented 10 months ago

@awakecoding Thank you the time you have taken into analysing this and finding the temporary fix. How does this works iOS? Any ideas?.

guyvaio commented 10 months ago

@awakecoding Thank you the time you have taken into analysing this and finding the temporary fix. How does this works iOS? Any ideas?.

As I tested it (with SQL Server) on Sep 25 there was no problem with iOS.

thenextman commented 10 months ago

@awakecoding Thank you the time you have taken into analysing this and finding the temporary fix. How does this works iOS? Any ideas?.

@last-Programmer The issue is in the dotnet TLS stack on Android, not on the SQL side. So iOS is unaffected.

awakecoding commented 10 months ago

My colleague @thenextman may have found the reason why the issue only happens with Android, from the docs of the Android SSLEngine APIs used by the .NET runtime:

The SSLEngine produces/consumes complete SSL/TLS packets only, and does not store application data internally between calls to wrap()/unwrap(). Thus input and output ByteBuffers must be sized appropriately to hold the maximum record that can be produced. Calls to SSLSession.getPacketBufferSize() and SSLSession.getApplicationBufferSize() should be used to determine the appropriate buffer sizes. The size of the outbound application data buffer generally does not matter. If buffer conditions do not allow for the proper consumption/production of data, the application must determine (via SSLEngineResult) and correct the problem, and then try the call again.

If we look at the .NET runtime DoWrap() implementation using the Android SSLEngine APIs (simplified and commented in-line here):

static PAL_SSLStreamStatus DoWrap(JNIEnv* env, SSLStream* sslStream, int* handshakeStatus)
{
    IGNORE_RETURN((*env)->CallObjectMethod(env, sslStream->appOutBuffer, g_ByteBufferFlip));
    jobject result = (*env)->CallObjectMethod(
        env, sslStream->sslEngine, g_SSLEngineWrap, sslStream->appOutBuffer, sslStream->netOutBuffer);
    if (CheckJNIExceptions(env))
        return SSLStreamStatus_Error;

    IGNORE_RETURN((*env)->CallObjectMethod(env, sslStream->appOutBuffer, g_ByteBufferCompact));

    *handshakeStatus = GetEnumAsInt(env, (*env)->CallObjectMethod(env, result, g_SSLEngineResultGetHandshakeStatus));
    int status = GetEnumAsInt(env, (*env)->CallObjectMethod(env, result, g_SSLEngineResultGetStatus));
    (*env)->DeleteLocalRef(env, result);

    if (g_SSLEngineResultStatusLegacyOrder)
    {
        status = MapLegacySSLEngineResultStatus(status);
    }

    switch (status)
    {
        case STATUS__OK:
        {
            return Flush(env, sslStream);
        }
        case STATUS__CLOSED:
        {
            (void)Flush(env, sslStream);
            (*env)->CallVoidMethod(env, sslStream->sslEngine, g_SSLEngineCloseOutbound);
            return SSLStreamStatus_Closed;
        }
        case STATUS__BUFFER_OVERFLOW:
        {
            // FIXME: we increase the output buffer size, but fail to call g_SSLEngineWrap again!
            // The alternative would be increase the buffer size *prior* to calling g_SSLEngineWrap and avoid this
            int32_t newCapacity = (*env)->CallIntMethod(env, sslStream->sslSession, g_SSLSessionGetPacketBufferSize) +
                                  (*env)->CallIntMethod(env, sslStream->netOutBuffer, g_ByteBufferRemaining);
            sslStream->netOutBuffer = ExpandBuffer(env, sslStream->netOutBuffer, newCapacity);
            return SSLStreamStatus_OK;
        }
        default:
        {
            LOG_ERROR("Unknown SSLEngineResult status: %d", status);
            return SSLStreamStatus_Error;
        }
    }
}

TL;DR: the Android SSLEngineWrap API return STATUS__BUFFER_OVERFLOW to tell that it could not write all the data bytes to the output buffer. The end result is a TLS record which holds only part of the data we wanted to send. Here we just catch the condition to increase the buffer size for the next call to SSLEngineWrap, but not the current one, which is already too late for SQLClient. Ideally, we should increase the buffer size before calling SSLEngineWrap to avoid getting the overflow status code, and get a single, clean TLS record.

awakecoding commented 10 months ago

I created a detailed issue report in the .NET runtime for the underlying problem that affects SqlClient: https://github.com/dotnet/runtime/issues/95295

last-Programmer commented 10 months ago

@awakecoding i need to set it to 8192 to get it working for my android app. my application uses SMO and it requires 8192 to get it working. It may break some other area of my app.

awakecoding commented 10 months ago

@awakecoding i need to set it to 8192 to get it working for my android app. my application uses SMO and it requires 8192 to get it working. It may break some other area of my app.

I would recommend adding this information to the issue on the .NET runtime side just so they know what minimum size should be required in practice, which is higher in your case: https://github.com/dotnet/runtime/issues/95295

JRahnama commented 10 months ago

@awakecoding so we can confirm that this is not SqlClient related issue? If the answer is yes, can we close the issue here and follow up in runtime repo please?

awakecoding commented 10 months ago

@awakecoding so we can confirm that this is not SqlClient related issue? If the answer is yes, can we close the issue here and follow up in runtime repo please?

The root cause of the problem is inside the .NET runtime and I do not know of a way to fix it within SqlClient, but I would like to keep this issue open here until it is fixed. Closing it would just make it look like it has been handled when it is not the case yet. However, let's follow up on the other issue, and come back to close it once the .NET runtime has an upstream fix.

wfurt commented 10 months ago

@awakecoding so we can confirm that this is not SqlClient related issue?

I'm wondering if we should push issue to the server. While we may improve theSslStream on Android I feel the server's behavior is finicky at best.

JRahnama commented 10 months ago

@wfurt I’m not certain who I should contact from the SQL Server team.

ErikEJ commented 10 months ago

@wfurt On the other hand, all clients on all other platforms work with the current server implementation

awakecoding commented 10 months ago

@wfurt we can certainly try to push the issue to the SQL Server team, but only as an improvement on their side, and something we cannot expect to be fixed in a timely manner.

Also, it works for all platforms other than Android, and MS-TDS documents this finicky behavior as a MUST. I feel like the best we'd get from the SQL Server team might be that they ask the protocol docs team to make the finicky behavior more clearly specified instead of improving the server.

wfurt commented 10 months ago

@wfurt On the other hand, all clients on all other platforms work with the current server implementation

It still may work by accident ;(

I can understand that the query is requested in single message but that should be SQL/TDS layer. Making assumptions about underlying TLS or TCP breaks protocol layering IMHO.

I do think we should address https://github.com/dotnet/runtime/issues/95295. But that may not be quick fix either, likely surfacing in 9.0 unless we can bring good case to servicing committee.

awakecoding commented 10 months ago

@wfurt this isn't the first Microsoft protocol I encounter that's picky like this, this is a prime example of the legacy of SSPI-based security being transposed to TLS: the old SSPI message wrapping is just replaced by TLS records, and someone 20 years ago didn't feel like refactoring the code to handle TLS record fragmentation, so here we are, looking forward to hot-patching the .NET 8 SDK in our builds until this is fixed upstream.

As a matter of fact we had to patch it again to increase the initial buffer size to 8192 as we did hit a special case where the query resulted in packets larger than 4096 bytes: System.Net.Security.dll.zip.

As for fixing this on the SQL Server side, unless you have a direct point of contact within the development team, I really doubt they'll prioritize fixing something that worked by accident for decades. To add on top of this, the simple fact that it's in the protocol specification shows that someone must have hit this very issue in the past, prompting them to document the fragile behavior instead of fixing the implementation.

last-Programmer commented 1 week ago

This seems to be fixed in .net9 rc.