Open tubular-rheostat opened 5 months ago
I tried testing with master
, I see PR https://github.com/cesanta/mongoose/pull/2646 is related but that change does not address this issue.
Can you please check if our built-in TLS stack solves your issue ? (Build with MG_TLS_BUILTIN instead)
Initial test with latest master
e4cd6bd8
and MG_TLS_BUILTIN
wouldn't connect, the TLS handshake never completed.
I traced it to here
In mg_tls_init
if (c->is_client) {
tls->server_cert_der.ptr = NULL;
return;
}
After that returns no TLS Client Hello was sent, eventually the broker disconnected due to its timeout.
I'll have some time to tinker with it this evening (US Central timezone)
It requires EC certificates and keys, the API is the same.
I am able to reproduce the issue with MG_TLS_BUILTIN
, the patch still resolves it. However, my theory of the cause is in doubt: I amended my patch to add logging to indicate when a 2nd iteration produced additional data, and my test message did not. However, if I remove my patch then my program stops working again.
Other messages, earlier in the MQTT handshake did require multiple iterations through this loop, as many as 5 iterations.
Here is the same patch, with logging
diff --git a/mongoose.c b/mongoose.c
index 62fde56a..48080ee8 100644
--- a/mongoose.c
+++ b/mongoose.c
@@ -5533,13 +5533,23 @@ static void read_conn(struct mg_connection *c, struct pkt *pkt) {
mg_error(c, "oom");
} else {
// Decrypt data directly into c->recv
- long n = mg_tls_recv(c, &io->buf[io->len], io->size - io->len);
- if (n == MG_IO_ERR) {
- mg_error(c, "TLS recv error");
- } else if (n > 0) {
- // Decrypted successfully - trigger MG_EV_READ
- io->len += (size_t) n;
- mg_call(c, MG_EV_READ, &n);
+ int ct = 0;
+ while(true) {
+ ++ct;
+ long n = mg_tls_recv(c, &io->buf[io->len], io->size - io->len);
+ if (n == MG_IO_ERR) {
+ mg_error(c, "TLS recv error");
+ break;
+ } else if (n > 0) {
+ // Decrypted successfully - trigger MG_EV_READ
+ io->len += (size_t) n;
+ mg_call(c, MG_EV_READ, &n);
+ if(ct > 1) {
+ MG_DEBUG(("TLS recv loop %d", ct));
+ }
+ } else {
+ break;
+ }
}
}
I attach logs of the transaction with and without my patch
And a pcap just for good measure
192.168.192.115 is the mongoose mqtt server under test 192.168.192.130 is a mosquitto broker, connecting to .115 as a bridge The interesting packet with 5 application data records is packet 20. My test message is packet 22.
Note that round-trip time between those two is around 22ms, I am suspecting that you need a certain amount of latency in order to repro the issue, otherwise the TCP packets would not accumulate so many TLS records. This could be verified by testing again with NODELAY turned on.
Anyway, this is enough for me today.
@cpq This seems to be yet another corner case. We get more than one TLS record, we put data in its buffer, we call TLS, we decrypt the first record. There is still outstanding data in the buffer, but since a whole record has been processed, tls_pending() won't help here. We need to react to outstanding data in the TLS buffer in order for tcpip_poll() to call tls_recv(). I think a similar problem will occur with sockets. As long as no data comes in, we stall with outstanding TLS records in their buffer. If new data comes in, we process the first outstanding record and the others stay there.
@tubular-rheostat Thank you, Brian, you helped a lot.
@tubular-rheostat thanks!
loops is receive code are dangerous. Brian, could you try another patch please?
Try to change this line https://github.com/cesanta/mongoose/blob/7581229bbfd31ea739ff3a7b9a8a230458063c82/src/tls_mbed.c#L171
To this:
return tls == NULL ? 0 : c->rtls.len + mbedtls_ssl_get_bytes_avail(&tls->ssl);
That change didn't help. I'm using Mongoose builtin TCP, the only code I saw referring to that edited function was in sock.c
.
I can try running my test code with BSD sockets but I probably won't have time to work on it until the weekend.
If it helps here is some of my test program:
https://gist.github.com/tubular-rheostat/1cdd37e82aef3e2117fc5d377e060b1b
Brian, as can be seen on the link, that change is in src/tls_mbed.c
, line 171, that is MbedTLS code.
Once you changed that line, you make mongoose.c
and you'll get that change transferred into mongoose.c.
You can also directly change mongoose.c line 10906:
https://github.com/cesanta/mongoose/blob/7581229bbfd31ea739ff3a7b9a8a230458063c82/mongoose.c#L10904-L10907
Please build and test with MbedTLS, this applies to both sockets and built-in TCP stack.
That is what I did. I copied the mongoose.c
from 7581229b
and applied your most recent patch. I tested with MbedTLS and built-in TCP.
When I said the function is not used outside of sock.c
, that statement was based on this search (attached image). I don't think that built-in TCP ever calls mg_tls_pending
, only the BSD Socket code calls it. Sorry if I wasn't clear before, I was interrupted before I could finish writing.
I see something similar on Windows, with -DMG_TLS=MG_TLS_MBED=1
and set V=3 & unit_test.exe
:
tls_mbed.c:114:mg_tls_init 2 Setting TLS
tls_mbed.c:142:mg_tls_init 2 hostname verification: localhost
tls_mbed.c:95:mg_tls_handshak TLS handshake: -0x6c00
sock.c:440:accept_conn 3 616 accepted 127.0.0.1:62645 -> 127.0.0.1:12347
tls_mbed.c:114:mg_tls_init 3 Setting TLS
net.c:150:mg_close_conn 2 608 closed
And -0x6c00 =="Internal error (eg, unexpected failure in lower-level module"
. I.e. MBEDTLS_ERR_SSL_INTERNAL_ERROR
that's thrown in gazillion of places.
With -DMG_TLS=MG_TLS_OPENSSL
I get: SUCCESS. Total tests: 1290
With -DMG_TLS=MG_TLS_BUILTIN
, I get: SUCCESS. Total tests: 1277
Seems something has changed in MBedTLS recently. I built it from git master
today. Similar handshake issue in libcurl :-(
@gvanem Mmmmm.... I think that is another issue, this one is not MbedTLS specific but our problem. We are not running unit tests for MbedTLS nor OpenSSL in Windows, the examples that do, clone and build a specific version of MbedTLS, as explained in the TLS tutorial "Only what the example provides in its Makefile is supported"
@tubular-rheostat Oops, you are right
@cpq I think the fix is cool. The elephant in the room is that our built-in TCP/IP stack's mg_mgr_poll()
won't poll TLS for outstanding received data to be processed, that only happens when read_conn()
is called and there is TCP data, and that one is called when a TCP-holding frame is received, from within rx_tcp()
. We do poll for outstanding data to be sent, by calling write_conn(), though
IMHO our MIP poll loop needs a tweak. The socket select/poll/epoll core does if (mg_tls_pending(c) > 0) ms = 1, c->is_readable = 1;
, that, IIUC, triggers read_conn that ends up calling mg_tls_recv() with outstanding data in the buffer (plus any possible new data).
Should we do something like this for MIP ?
diff --git a/src/net_builtin.c b/src/net_builtin.c
index 98a58e75..3390d16d 100644
--- a/src/net_builtin.c
+++ b/src/net_builtin.c
@@ -613,6 +613,17 @@ long mg_io_send(struct mg_connection *c, const void *buf, size_t len) {
return (long) len;
}
+static void handle_tls_recv(struct mg_iobuf *io) {
+ long n = mg_tls_recv(c, &io->buf[io->len], io->size - io->len);
+ if (n == MG_IO_ERR) {
+ mg_error(c, "TLS recv error");
+ } else if (n > 0) {
+ // Decrypted successfully - trigger MG_EV_READ
+ io->len += (size_t) n;
+ mg_call(c, MG_EV_READ, &n);
+ }
+}
+
static void read_conn(struct mg_connection *c, struct pkt *pkt) {
struct connstate *s = (struct connstate *) (c + 1);
struct mg_iobuf *io = c->is_tls ? &c->rtls : &c->recv;
@@ -691,14 +702,7 @@ static void read_conn(struct mg_connection *c, struct pkt *pkt) {
mg_error(c, "oom");
} else {
// Decrypt data directly into c->recv
- long n = mg_tls_recv(c, &io->buf[io->len], io->size - io->len);
- if (n == MG_IO_ERR) {
- mg_error(c, "TLS recv error");
- } else if (n > 0) {
- // Decrypted successfully - trigger MG_EV_READ
- io->len += (size_t) n;
- mg_call(c, MG_EV_READ, &n);
- }
+ handle_tls_rcv(io);
}
} else {
// Plain text connection, data is already in c->recv, trigger
@@ -1109,6 +1113,8 @@ void mg_mgr_poll(struct mg_mgr *mgr, int ms) {
MG_VERBOSE(("%lu .. %c%c%c%c%c", c->id, c->is_tls ? 'T' : 't',
c->is_connecting ? 'C' : 'c', c->is_tls_hs ? 'H' : 'h',
c->is_resolving ? 'R' : 'r', c->is_closing ? 'C' : 'c'));
+ if (c->is_tls && mg_tls_pending(c) > 0)
+ handle_tls_recv((struct mg_iobuf *) &c->rtls);
if (can_write(c)) write_conn(c);
if (c->is_draining && c->send.len == 0 && s->ttype != MIP_TTYPE_FIN)
init_closure(c);
WDYT ?
I will try that last patch on my test env., probably in the evening (US Central time).
I tried testing with BSD sockets + MbedTLS but could never repro the issue (with unmodified 7.13
). For that I had to switch from testing on an rp2040 to testing on a Mac. so the test environment is not the same. I will try with Mongoose built-in TCP on the Mac.
For testing on the Mac I also tried simulating latency by adding a delay to the MQTT broker machine's network interface, on the assumption that latency helps repro the issue, but it had no effect, using the Linux tc
command. I think I probably set this up wrong, I'll experiment.
Yeah, this is really hard to reproduce. My suggestion would work on our built-in TCP/IP stack but is just an idea, we haven't actually tried it. If you can, that would be a nice input.
I tried the most recent patch (modified slightly to get it to compile) against master 7581229b
and it wouldn't connect to the MQTT server. I think this is during TLS init, it opened the TCP connection but closes it again before sending the Client Hello record.
82c ERROR mongoose.c:1363:mg_error 1 -1 TLS recv error
I think the underlying MbedTLS error code was MBEDTLS_ERR_SSL_FEATURE_UNAVAILABLE
.
I was not able to work out what is going wrong.
I was able to repro the issue on my laptop, see https://github.com/tubular-rheostat/mongoose-test-2668 for a test program.
Hopefully this makes further development work and testing easier. Apologies if you don't have a Mac to test on...
I tried retesting with @scarpile's latest patch and didn't have the problem with TLS handshake again, it was able to connect. I don't know what I did that was different. It did not resolve my issue though. I don't think the new if
in mg_mgr_poll
was ever taken (was always false).
I was experimenting with an alternate fix, just have mg_tls_recv
process as many records as it can, within the available buffer size.
long mg_tls_recv(struct mg_connection *c, void *buf, size_t len) {
struct mg_tls *tls = (struct mg_tls *) c->tls;
size_t total_read = 0;
while (total_read < len) {
long n = mbedtls_ssl_read(&tls->ssl, buf + total_read, len - total_read);
if(n > 0) {
total_read += (size_t) n;
} else if (n == MBEDTLS_ERR_SSL_WANT_READ || n == MBEDTLS_ERR_SSL_WANT_WRITE) {
if(total_read == 0) {
return MG_IO_WAIT;
}
return (long) total_read;
} else {
if(total_read == 0) {
return MG_IO_ERR;
}
return (long) total_read;
}
}
// oops we filled the entire buffer; unread data may still remain!
return (long) total_read;
}
This code has some issues
If you refer to the lwIP implementation [github.com/lwip-tcpip] of MbedTLS they also have a loop that continues reading until the read call returns an error code.
I was reading about OpenSSL and it implies similar limitations [openssl.org] reading multiple records requires multiple calls to SSL_read
I don't have any experience with OpenSSL.
At most the contents of one record will be returned.
Anyway, thanks for looking at the issue. I'm happy with the patch from my initial comment and will continue working with that.
Let me know if you want me to try anything else in my test environment.
Brian, we already have a loop in place, at the outer level, we will avoid inner loops as much as possible. cpq is busy with Embedded World now, he is the man and he also has a Mac, we will fix this issue as we always do; it may take a bit longer. My patch was a shoot in the dark as I don't see the problem, just mimicked our behavior in other environments, I may have left something important out and of course could not test anything. I may try to reproduce what you posted. in my Linux box. Anyway, we'll let you know when we have something.
@tubular-rheostat @scaprile thanks!
I used Brian's test program, just modified it to use pcap instead of tun , just to avoid sudo-ing and ifconfig-in all the time.
Then I changed it to use broker.hivemq.com:8883 instead of local mosquitto, cause local mosquitto did not start for whatever reasom.
I applied Sergio's patch and it works OK (using ca.pem file from test/data):
f4668819 DEBUG mongoose.c:4775:mg_connect 3 -1 mqtts://broker.hivemq.com:8883
f466881c DEBUG mongoose.c:4749:mg_close_conn 1 -1 closed
f4668842 DEBUG mongoose.c:1230:dns_cb 3 broker.hivemq.com is 3.73.141.152
f4668843 DEBUG mongoose.c:5914:mg_connect_resolved 3 192.168.2.19:34433 -> 3.73.141.152:8883
f4668886 DEBUG main.cpp:309:mqtt_client_fn CA: 30427
f4668886 DEBUG mongoose.c:10848:mg_tls_init 3 Setting TLS
f4668ca1 DEBUG mongoose.c:10821:mg_tls_handshake 3 success
connected to mqtt
f4668ce1 DEBUG mongoose.c:4503:mqtt_cb 3 Connected
test_req message: {"topic":"reply/202376DE","correlation_id":"890954FD-BB39-45B3-8398-4CA3202376DE"}
f466f9e2 DEBUG mongoose.c:4383:mg_mqtt_pub 3 [reply/202376DE] -> [{"timestamp":4100389346,"conn_id":2,"correlation_id":"890954FD-BB39-45B3-8398-4CA3202376DE"}]
The suggested patch looks good to me. @scaprile do you want to turn it into a PR?
Thanks @cpq for looking at the issue again.
I see some improvement but can still repro the issue. With master 9cccb983
including this PR, using my test program (modified to address a renamed field in mg_str
) I can get the mqtt_req_resp
script in my test repo to work one time but if I run the script a 2nd time it still hangs waiting for a reply.
I didn't know about pcap I'll check out the example.
@tubular-rheostat thanks. do we use the same connection when the script is executed the 2nd time? in other words, what's different the 2nd time? And, does it fail with built-in TLS?
The same connection is used the 2nd time. From the steps in my demonstration code I just repeated step 11.
@tubular-rheostat did you try the pcap example? It'd be nice to be able to reproduce is locally.
Environment
If more than one TLS record is in the same TCP packet Mongoose will process only the first record, leaving the second in the buffer until another packet is received. A possible solution is to call
mbedtls_ssl_read
in a loop until it returnsMBEDTLS_ERR_SSL_WANT_READ
or similar value when there is no more data.Image from WireShark showing two TLS records in one TCP packet
My description that follows refers to MQTT but that is coincidental.
My test program is an MQTT client, it subscribes to
$SYS/broker/clients/connected
andtest_req
.That command connects to the broker and sends a message. Because I am subscribed to the broker's client count topic, this often causes two messages to be published: The updated client count as well as the message on
test_req
.I applied this patch to
7.13
and it resolved the issue: