Open tripleslash opened 9 months ago
Which version of trantor or drogon did you use?
My trantor lib is built from commit tree https://github.com/an-tao/trantor/tree/26a599857a2fa5d61823ef85012e3d9123d5a675 as of right now and drogon is built from tag 1.9.0 tree.
I saw there were no relevant changes to the OpenSSL handling in either drogon or trantor since those tags
If I keep the connection open and don't cancel it I get the following curl error after 30 seconds:
Is it possible that the last chunk of data is not properly flushed to the TCP socket when the HTTP connection is closed? Maybe there are still some bytes pending and trantor does not wait until it is fully transferred to the client before closing the connection?
I can't reproduce this error in my environment.
curl -X 'GET' -k https://10.0.0.100:1100/test -v -s -o /dev/null
* Trying 10.0.0.100:1100...
* Connected to 10.0.0.100 (10.0.0.100) port 1100 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [6 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [1470 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [520 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
* subject: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
* start date: Feb 9 12:37:34 2023 GMT
* expire date: Jan 16 12:37:34 2123 GMT
* issuer: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
* SSL certificate verify result: self-signed certificate (18), continuing anyway.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /test HTTP/1.1
> Host: 10.0.0.100:1100
> User-Agent: curl/7.81.0
> Accept: */*
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 56531
< content-type: text/html; charset=utf-8
< server: drogon/1.9.1
< date: Mon, 27 Nov 2023 15:49:03 GMT
<
{ [16243 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Connection #0 to host 10.0.0.100 left intact
@marty1885 What do you think might be causing this issue?
I had a similar issue once in an implementation of a TCP SSL socket using asio. Turned out I had to call asio::ssl_stream::shutdown before calling the socket close method. Maybe it could be something similar here? Where some bytes remain unsent on the socket until it is gracefully shut down (which would cause a flush)?
Also happens with latest trantor (v1.5.15) and drogon (v1.9.1)
Me too I can't replicate the issue what so ever. This is what is blocking me from fixing it. :(
@tripleslash Can you share details about the environment you are using? OpenSSL version, compiler, distro, etc.. And if you have Botan3. Can you try Botan 3 and see if the same thing is happening?
Custom built Linux BSP with Kernel 6.5.12 and OpenSSL 3.1.0. I cant change my TLS Provider because many other applications on my system depend on OpenSSL Compiler is GCC 12.2.1 armv7
20231128 08:24:30.062488 UTC 4347 TRACE [newFileResponse] send http file:/var/www//ui/assets/images/loadingd80ae7c176ae41d8797c.png offset 0 length 0 - HttpResponseImpl.cc:296
20231128 08:24:30.063095 UTC 4347 TRACE [sendStaticFileResponse] Save in cache for 5 seconds - StaticFileRouter.cc:562
20231128 08:24:30.063734 UTC 4347 TRACE [renderToBuffer] reponse(no body):HTTP/1.1 200 OK
content-length: 51581
content-type: image/png
server: backend-api/0.1.0.6824
access-control-allow-origin: *
accept-range: bytes
expires: Thu, 01 Jan 1970 00:00:00 GMT
last-modified: Mon, 27 Nov 2023 15:39:30 GMT
date: Tue, 28 Nov 2023 08:24:30 GMT
- HttpResponseImpl.cc:704
20231128 08:24:30.069186 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.086815 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.086909 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.093514 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.107780 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.107874 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.124704 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.155241 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.155333 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.181652 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.223903 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:33.879093 UTC 4347 TRACE [operator()] Erase cache - StaticFileRouter.cc:568
20231128 08:24:34.320739 UTC 4347 TRACE [recvData] Received 473 bytes from lower layer - OpenSSLProvider.cc:533
20231128 08:24:34.321107 UTC 4347 TRACE [processApplicationData] Received 451 bytes from SSL - OpenSSLProvider.cc:739
This is what happens when an incomplete file is transferred in the drogon.log with the highest log level. Maybe this can help you find the problem?
Can you test your application in a normal linux x86 server?
TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
this line always seems to happen when the issue appears
TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
this line always seems to happen when the issue appears
You could modify that line to output the error number.
That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.
@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.
Resource temporarily unavailable (errno=11) send file in loop: break on socket buffer full (?) - 11
errno = 11 = EAGAIN
That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.
@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.
It is quite some effort to add a new BSP build rule to build a completely new TLS lib. Apart from that I really would like to use the same TLS implementation across the entire system
The issue seems to happen more frequently when you try to connect to the device from outside the intranet or over SSH proxies for example. In intranet it is hard to reproduce I agree.
Try to establish a connection from some remote network (maybe over your phone via LTE) and see if it happens for you. If not then try an ssh socks5 proxy.
The File also needs to be quite large. 100 KB should be enough.
@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?
Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?
@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?
Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?
What type of access are you looking for? I could forward a https Port but cannot give you SSH access to the device. If thats good enough for you to see the issue in action then you can add me on Discord under xerabtw
@tripleslash Sorry getting back to you late. Can you try the latest master? We fixed a bug that may be related to this bug report.
@marty1885 This issue still persists with the current master of trantor+drogon
I think the PR may resolve this problem, @tripleslash would u please check it out?
I think the PR may resolve this problem, @tripleslash would u please check it out?
I can confirm that the issue seems to be gone with the refactored trantor API ! @an-tao
another update on this: i was able to still reproduce this error but it seems to happen much more rarely now with the trantor refactoring
@tripleslash Thanks so much for your test on this. I added some debug logs to the destructor of the Tcp Connection Impl class (https://github.com/an-tao/trantor/pull/312/commits/6f854779ecaa96a43a249265045b263328d49e3f). please check it out. when the hanging occurs, disconnect from client, and find those logs on server, let's figure it out.
When a HTTPs response with a large content body is sent, then not all bytes will be transferred and the connection will hang indefinitely until it reaches a timeout.
Example request:
Is it possible that the last chunk of data is not flushed to the output?