Open HamzaHajeir opened 1 year ago
I've noticed that some times the preceeding handshake's error code is -17168 (-0x4310), which I've tried to find the represented meaning, but failed to
Use programs/util/strerror
to see the error message corresponding to a numerical value. The values don't depend on the architecture or on the compile-time configuration, so you can build a copy of the library in the default configuration on your development machine for this program.
mbedtls-strerror -0x4310
Last error was: -0x4310 - RSA - The private key operation failed : BIGNUM - Memory allocation failed
So something is leaking memory.
The memory leak is probably the cause of the signature verification failure, not a consequence. Although I am a bit surprised that the error is consistently MBEDTLS_ERR_RSA_VERIFY_FAILED
: looking at our RSA code, it never returns this exact error for out-of-memory. So I suspect that there's an uncaught malloc failure somewhere else which leads to an invalid key object or invalid buffer content at some point.
Use
programs/util/strerror
to see the error message corresponding to a numerical value. The values don't depend on the architecture or on the compile-time configuration, so you can build a copy of the library in the default configuration on your development machine for this program.mbedtls-strerror -0x4310 Last error was: -0x4310 - RSA - The private key operation failed : BIGNUM - Memory allocation failed
I've missed it when writing, I was looking for the bit-mask corresponding codes and it's right, 0x0010 corresponds to a failed allocation.
So something is leaking memory.
The memory leak is probably the cause of the signature verification failure, not a consequence. Although I am a bit surprised that the error is consistently
MBEDTLS_ERR_RSA_VERIFY_FAILED
: looking at our RSA code, it never returns this exact error for out-of-memory. So I suspect that there's an uncaught malloc failure somewhere else which leads to an invalid key object or invalid buffer content at some point.
But the heap memory is available here, except it's an allocation within a pool managed by mbedtls, what do you think? As my calculation over 48 failed handshake, it was ~165 Bytes leaked per failed connection on average.
I could narrow down the issue to the state which the error being return:
Debug level 1.
00:12:25.015 > --> mbedtls_ssl_handshake_step state 4 heap 57868 -->
00:12:25.803 > IDF/components/mbedtls/mbedtls/library/ssl_srv.c:3542: mbedtls_pk_sign() returned -17280 (-0x4380)
00:12:25.805 > <-- mbedtls_ssl_handshake_step state 4 ret -17280 heap 44964 <--
This corresponds to code:
bool mbedtls_is_ssl_handshake_over = false;
int ret;
do {
LWIP_DEBUGF(ALTCP_MBEDTLS_DEBUG, ("--> mbedtls_ssl_handshake_step state %d heap %u -->\n", state->ssl_context.state, get_heap()));
ret = mbedtls_ssl_handshake_step(&state->ssl_context);
mbedtls_is_ssl_handshake_over = (state->ssl_context.state == MBEDTLS_SSL_HANDSHAKE_OVER);
LWIP_DEBUGF(ALTCP_MBEDTLS_DEBUG, ("<-- mbedtls_ssl_handshake_step state %d ret %d heap %u <--\n", state->ssl_context.state, ret, get_heap()));
}
while (!mbedtls_is_ssl_handshake_over && !ret);
As a continuation of debug, an attached log of several https requests happen, where SSL session is also managed.
There happens memory leak for every new HTTPS connection handshake.
device-monitor-230629-161033 copy 2.log
A manual memory leak calculation per handshake suggests multiple probable bytes per handshake (ranges from 203B to 299B).
No memory leak encountered by a reconnect handshake as a client.
Can you use tools like Asan to identify the memory leak precisely? If not, can you at least find the exact amount that's leaked, and look for a structure that has that size?
We run a lot of tests with Asan and they validate that the library doesn't have memory leaks. But of course that only validates that there are no memory leaks for the settings that we test: there can be a leak in a different configuration or scenario.
I'll check that out soon, meanwhile can you mention which structures I should check against?
On Thu, Jun 29, 2023, 17:54 Gilles Peskine @.***> wrote:
Can you use tools like Asan https://github.com/google/sanitizers/wiki/AddressSanitizer to identify the memory leak precisely? If not, can you at least find the exact amount that's leaked, and look for a structure that has that size?
We run a lot of tests with Asan and they validate that the library doesn't have memory leaks. But of course that only validates that there are no memory leaks for the settings that we test: there can be a leak in a different configuration or scenario.
— Reply to this email directly, view it on GitHub https://github.com/Mbed-TLS/mbedtls/issues/7833#issuecomment-1613326900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH3O7J4XGHCCMPOEESXLZHTXNWJKXANCNFSM6AAAAAAZTGBEBY . You are receiving this because you authored the thread.Message ID: @.***>
Probably one of the SSL structures (mbedtls_ssl_context
, mbedtls_ssl_session
or mbedtls_ssl_handshake
— their size heavily depends on the configuration), or the RSA key (mbedtls_rsa_context
) but that's less than 200B (or far more if it's the RSA context and the embedded MPI objects). But it could be something else, like a certificate mbedtls_x509_crt
.
Given the size of SSL structures in my setup:
mbedtls_ssl_session size = 128
mbedtls_ssl_context size = 552
mbedtls_ssl_config size = 232
mbedtls_ssl_transform size = 212
mbedtls_ssl_handshake_params size = 2264
mbedtls_ssl_sig_hash_set_t size = 8
mbedtls_ssl_key_cert size = 12
mbedtls_ssl_flight_item size = 16
And given that I've identified the exact amount of memory leaked per handshake, which is 164 Bytes, which is the sum of the following structures:
Given that I'm not initiating any session for server requests, should my code manage them even so? and how further?
** My previous calculation was based on Web refresh with the browser, which initiates more than a connection simultaneously as needed by. Last calculation is based on CURL command line which initiates exactly one connection against the server.
Given that I'm not initiating any session for server requests, should my code manage them even so?
I think your code should only reference mbedtls_ssl_session
in a server that does session resumption (with tickets or a cache). Otherwise the session structure is just internal to the mbedtls_ssl_context
and should be freed when you free or reset the context.
In the current setup I've the tickets and cache enabled and managed, but it's shared across requests (at the server level): https://github.com/HamzaHajeir/esp-lwip/blob/2.1.2-esp-patch/src/apps/altcp_tls/altcp_tls_mbedtls.c#L115-L121
Initialization: https://github.com/HamzaHajeir/esp-lwip/blob/2.1.2-esp-patch/src/apps/altcp_tls/altcp_tls_mbedtls.c#L848-L868
Freeing is done when the listening connection (in LwIP it's LISTEN Protocol Control Block - PCB) gets closed, only: https://github.com/HamzaHajeir/esp-lwip/blob/2.1.2-esp-patch/src/apps/altcp_tls/altcp_tls_mbedtls.c#L1067-L1072
Should these be managed on request-level .. not server-level?
It's insured that the mbedtls_ssl_context
is freed upon any connection close, see:
https://github.com/HamzaHajeir/esp-lwip/blob/2.1.2-esp-patch/src/apps/altcp_tls/altcp_tls_mbedtls.c#L1357
What do you think?
I had a quick look at your code and I don't see anything obviously wrong. But I'm not familiar at all with LWIP, it might be worth asking people who are familiar with it.
One thing to pay attention is, if there has ever been an error on the connection, then you must close it: attempting to use the connection may put the SSL object in a bad state. But the error handling for the handshake looks fine to me.
I've just noticed missing updates to esp-lwip repo meeting latest changes, occured when I was preparing lwip patches.
One thing to pay attention is, if there has ever been an error on the connection, then you must close it: attempting to use the connection may put the SSL object in a bad state. But the error handling for the handshake looks fine to me.
Correct, the TCP connection reset is well managed there, but we might further check these lines which are TLS protocol related. Although logs across history show only one occurance of connection was closed gracefully
.
Also what I've suspected, but couldn't confirm due to not-enough experience in C, is the memory initialization of configuration struct which seems out of order compared to the structure sequence. But however, this is done per server setup, not every connection.
Another note: In a previous manual calculation of memory leak per handshake (browser refreshes "some with full refresh") resulted an average leak of 165B across 48 refresh instances. Which seems very close to the observed 164B.
So this might relate to a full handshake process; wherein the browser initiates more than a connection, and when one has succeeded it aborts the other, shortening the probability of leak to be in advanced states (MBEDTLS_SSL_CLIENT_CERTIFICATE
to MBEDTLS_SSL_HANDSHAKE_WRAPUP
) if it was true.
What do you think?
seems out of order compared to the structure sequence
That doesn't matter.
the browser initiates more than a connection, and when one has succeeded it aborts the other
This is a promising lead. We have a lot of negative tests where the client sends an invalid message, but I'm not sure we have tests where the client just closes the connection. However, I do see a call to mbedtls_ssl_free
in altcp_mbedtls_dealloc
, and that does free all the session data (it's not missable). So as long as this function is called, there should be no leak of mbedtls data. And if the struct altcp_pcb
isn't leaking then presumably altcp_mbedtls_dealloc
is getting called.
I've just confirmed that altcp_close(conn) is being called to a given connection, also old logs that enabled LwIP debug showed altcp_mbedtls_dealloc
is being called. (debug line in a local build).
I've further made another test wherein I've turned it to ALTCP_TCP rather ALTCP_TLS, to check whether it's an application/TCP leak or not, it initially showed a leak, but the memory did recover after a while (I think by LwIP internal operation in freeing closed PCBs).
Suspecting my code which is a library I maintain, I've failed in finding any source of memory leak, really (One can trace H4AT_TLS
macro for TLS-related code).
Summary
System information
Mbed TLS version (number or commit id): 2.28.3 Operating system and version: Built with Ubuntu Configuration (if not default, please attach
mbedtls_config.h
): Please find attached config.txtCompiler and options (if you used a pre-built binary, please indicate how you obtained it): With default esp-idf compilation environment: version
release/v4.4
Additional environment information: Built using a fork of espressif/esp32-arduino-lib-builder
Expected behavior
I host TLS webserver using LwIP ALTCP TLS - MBEDTLS port, I expect every client (browser) to be able to connect anytime.
In essence, It works correctly to some point, correct behaviour includes making successful handshakes and communication. The problem starts after some amount of time/process, in which causes to deny any new connections, even from browsers were already connected by the means of a reload button.
Already connected connections prior to this point continue to communicate successfully afterwards.
The error code returned by the
mbedtls_ssl_handshake
is fixed at -17280 (-0x4380) (MBEDTLS_ERR_RSA_VERIFY_FAILED).I've noticed that some times the preceeding handshake's error code is -17168 (-0x4310), which I've tried to find the represented meaning, but failed to, nearest assumption is
MBEDTLS_ERR_RSA_PRIVATE_FAILED | MBEDTLS_ERR_MPI_NEGATIVE_VALUE
.Is there any hint for what's going wrong? With thanks,
Actual behavior
Log shows:
Or in the other described case:
Steps to reproduce
It's hard to tell how to reproduce, as the time is uncertain, and upon that any further reset of the webserver (including SSL context) yield in yet again successful handshakes until it reappears.
However, here's the running ALTCP MBEDTLS Code, pointing to the handshake API call:
https://github.com/HamzaHajeir/esp-lwip/blob/2.1.2-esp-patch/src/apps/altcp_tls/altcp_tls_mbedtls.c#L278
Additional information
Meanwhile this issue, the MCU is able to reconnect to a MQTT server by TLS with successful (one-way auth) handshakes,
My investigation of a very long log shows that this problem leaks memory too.
Debug Logs.