an-tao / trantor

a non-blocking I/O tcp network lib based on c++14/17
Other
390 stars 133 forks source link

OpenSSL TLS provider not working correctly #304

Open tripleslash opened 9 months ago

tripleslash commented 9 months ago

When a HTTPs response with a large content body is sent, then not all bytes will be transferred and the connection will hang indefinitely until it reaches a timeout.

Example request:

curl -X 'GET' -k \ 'https://192.168.200.232/v2/endpoint' \ -H 'accept: application/json' \ -H 'Authorization: Basic dGVzdDo=' --trace-time -v Note: Unnecessary use of -X or --request, GET is already inferred. 15:44:55.711265 Trying 192.168.200.232:443... 15:44:55.717252 Connected to 192.168.200.232 (192.168.200.232) port 443 (#0) 15:44:55.719318 ALPN, offering h2 15:44:55.719346 ALPN, offering http/1.1 15:44:55.719472 TLSv1.0 (OUT), TLS header, Certificate Status (22): 15:44:55.719503 TLSv1.3 (OUT), TLS handshake, Client hello (1): 15:44:55.761049 TLSv1.2 (IN), TLS header, Certificate Status (22): 15:44:55.761226 TLSv1.3 (IN), TLS handshake, Server hello (2): 15:44:55.761775 TLSv1.2 (IN), TLS header, Finished (20): 15:44:55.761849 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.761886 TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): 15:44:55.761913 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.761999 TLSv1.3 (IN), TLS handshake, Certificate (11): 15:44:55.763293 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.763344 TLSv1.3 (IN), TLS handshake, CERT verify (15): 15:44:55.763410 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.763449 TLSv1.3 (IN), TLS handshake, Finished (20): 15:44:55.763489 TLSv1.2 (OUT), TLS header, Finished (20): 15:44:55.763513 TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): 15:44:55.763558 TLSv1.2 (OUT), TLS header, Supplemental data (23): 15:44:55.763591 TLSv1.3 (OUT), TLS handshake, Finished (20): 15:44:55.763633 SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 15:44:55.763656 ALPN, server did not agree to a protocol 15:44:55.763675 Server certificate: 15:44:55.763690 subject: 15:44:55.763699 start date: Oct 4 08:46:14 2023 GMT 15:44:55.763707 expire date: Sep 10 08:46:14 2123 GMT 15:44:55.763722 issuer: 15:44:55.763752 SSL certificate verify result: self-signed certificate in certificate chain (19), continuing anyway. 15:44:55.763795 TLSv1.2 (OUT), TLS header, Supplemental data (23): 15:44:55.763831 > GET /v2/endpoint HTTP/1.1 15:44:55.763831 > Host: 192.168.200.232 15:44:55.763831 > User-Agent: curl/7.81.0 15:44:55.763831 > accept: application/json 15:44:55.763831 > Authorization: Basic dGVzdDo= 15:44:55.763831 > 15:44:55.776094 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.776234 TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): 15:44:55.776353 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.776497 TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): 15:44:55.776742 old SSL session ID is stale, removing 15:44:56.901714 TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:56.910324 Mark bundle as not supporting multiuse 15:44:56.910420 < HTTP/1.1 200 OK 15:44:56.910559 < content-length: 56531 15:44:56.910639 < content-type: application/json; charset=utf-8 15:44:56.910772 < server: backend-api/0.1.0.6822 15:44:56.910938 < access-control-allow-origin: 15:44:56.911114 < date: Mon, 27 Nov 2023 14:44:52 GMT 15:44:56.911246 < {"json_content_that_will_be_cut_off 15:44:56.912609 * TLSv1.2 (IN), TLS header, Supplemental data (23): ^C ////// <<<< timeout

Is it possible that the last chunk of data is not flushed to the output?

an-tao commented 9 months ago

Which version of trantor or drogon did you use?

tripleslash commented 9 months ago

My trantor lib is built from commit tree https://github.com/an-tao/trantor/tree/26a599857a2fa5d61823ef85012e3d9123d5a675 as of right now and drogon is built from tag 1.9.0 tree.

I saw there were no relevant changes to the OpenSSL handling in either drogon or trantor since those tags

tripleslash commented 9 months ago

If I keep the connection open and don't cancel it I get the following curl error after 30 seconds:

image

tripleslash commented 9 months ago

Is it possible that the last chunk of data is not properly flushed to the TCP socket when the HTTP connection is closed? Maybe there are still some bytes pending and trantor does not wait until it is fully transferred to the client before closing the connection?

an-tao commented 9 months ago

I can't reproduce this error in my environment.

curl -X 'GET' -k https://10.0.0.100:1100/test -v -s -o /dev/null
*   Trying 10.0.0.100:1100...
* Connected to 10.0.0.100 (10.0.0.100) port 1100 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [6 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [1470 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [520 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
*  start date: Feb  9 12:37:34 2023 GMT
*  expire date: Jan 16 12:37:34 2123 GMT
*  issuer: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
*  SSL certificate verify result: self-signed certificate (18), continuing anyway.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /test HTTP/1.1
> Host: 10.0.0.100:1100
> User-Agent: curl/7.81.0
> Accept: */*
> 
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 56531
< content-type: text/html; charset=utf-8
< server: drogon/1.9.1
< date: Mon, 27 Nov 2023 15:49:03 GMT
< 
{ [16243 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Connection #0 to host 10.0.0.100 left intact

@marty1885 What do you think might be causing this issue?

tripleslash commented 9 months ago

I had a similar issue once in an implementation of a TCP SSL socket using asio. Turned out I had to call asio::ssl_stream::shutdown before calling the socket close method. Maybe it could be something similar here? Where some bytes remain unsent on the socket until it is gracefully shut down (which would cause a flush)?

tripleslash commented 9 months ago

Also happens with latest trantor (v1.5.15) and drogon (v1.9.1)

marty1885 commented 9 months ago

Me too I can't replicate the issue what so ever. This is what is blocking me from fixing it. :(

@tripleslash Can you share details about the environment you are using? OpenSSL version, compiler, distro, etc.. And if you have Botan3. Can you try Botan 3 and see if the same thing is happening?

tripleslash commented 9 months ago

Custom built Linux BSP with Kernel 6.5.12 and OpenSSL 3.1.0. I cant change my TLS Provider because many other applications on my system depend on OpenSSL Compiler is GCC 12.2.1 armv7

tripleslash commented 9 months ago
20231128 08:24:30.062488 UTC 4347 TRACE [newFileResponse] send http file:/var/www//ui/assets/images/loadingd80ae7c176ae41d8797c.png offset 0 length 0 - HttpResponseImpl.cc:296
20231128 08:24:30.063095 UTC 4347 TRACE [sendStaticFileResponse] Save in cache for 5 seconds - StaticFileRouter.cc:562
20231128 08:24:30.063734 UTC 4347 TRACE [renderToBuffer] reponse(no body):HTTP/1.1 200 OK
content-length: 51581
content-type: image/png
server: backend-api/0.1.0.6824
access-control-allow-origin: *
accept-range: bytes
expires: Thu, 01 Jan 1970 00:00:00 GMT
last-modified: Mon, 27 Nov 2023 15:39:30 GMT
date: Tue, 28 Nov 2023 08:24:30 GMT

 - HttpResponseImpl.cc:704
20231128 08:24:30.069186 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.086815 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.086909 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.093514 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.107780 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.107874 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.124704 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.155241 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.155333 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.181652 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.223903 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:33.879093 UTC 4347 TRACE [operator()] Erase cache - StaticFileRouter.cc:568
20231128 08:24:34.320739 UTC 4347 TRACE [recvData] Received 473 bytes from lower layer - OpenSSLProvider.cc:533
20231128 08:24:34.321107 UTC 4347 TRACE [processApplicationData] Received 451 bytes from SSL - OpenSSLProvider.cc:739

This is what happens when an incomplete file is transferred in the drogon.log with the highest log level. Maybe this can help you find the problem?

an-tao commented 9 months ago

Can you test your application in a normal linux x86 server?

tripleslash commented 9 months ago

TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215 this line always seems to happen when the issue appears

an-tao commented 9 months ago

TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215 this line always seems to happen when the issue appears

You could modify that line to output the error number.

marty1885 commented 9 months ago

That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.

@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.

tripleslash commented 9 months ago

Resource temporarily unavailable (errno=11) send file in loop: break on socket buffer full (?) - 11

errno = 11 = EAGAIN

tripleslash commented 9 months ago

That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.

@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.

It is quite some effort to add a new BSP build rule to build a completely new TLS lib. Apart from that I really would like to use the same TLS implementation across the entire system

tripleslash commented 9 months ago

The issue seems to happen more frequently when you try to connect to the device from outside the intranet or over SSH proxies for example. In intranet it is hard to reproduce I agree.

Try to establish a connection from some remote network (maybe over your phone via LTE) and see if it happens for you. If not then try an ssh socks5 proxy.

The File also needs to be quite large. 100 KB should be enough.

marty1885 commented 9 months ago

@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?

Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?

tripleslash commented 9 months ago

@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?

Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?

What type of access are you looking for? I could forward a https Port but cannot give you SSH access to the device. If thats good enough for you to see the issue in action then you can add me on Discord under xerabtw

marty1885 commented 9 months ago

@tripleslash Sorry getting back to you late. Can you try the latest master? We fixed a bug that may be related to this bug report.

tripleslash commented 9 months ago

@marty1885 This issue still persists with the current master of trantor+drogon

an-tao commented 9 months ago

I think the PR may resolve this problem, @tripleslash would u please check it out?

tripleslash commented 9 months ago

I think the PR may resolve this problem, @tripleslash would u please check it out?

I can confirm that the issue seems to be gone with the refactored trantor API ! @an-tao

tripleslash commented 8 months ago

another update on this: i was able to still reproduce this error but it seems to happen much more rarely now with the trantor refactoring

an-tao commented 8 months ago

@tripleslash Thanks so much for your test on this. I added some debug logs to the destructor of the Tcp Connection Impl class (https://github.com/an-tao/trantor/pull/312/commits/6f854779ecaa96a43a249265045b263328d49e3f). please check it out. when the hanging occurs, disconnect from client, and find those logs on server, let's figure it out.