Closed Quinnvanderschaar2 closed 1 year ago
Hi @Quinnvanderschaar2 I assume you're using SAS keys for auth?
Yes
I have this repro'd. Let me look into this.
Hey @Quinnvanderschaar2 If you have a moment, can you try this branch here and see if it helps? This will let the device attempt a reconnect on process loop failure. In MQTT 3, a token expiration requires a disconnect and reconnect, which this does. Also to note for you and others, the extra ~5 minutes above the default 60 minutes is to account for potential clock drift.
https://github.com/Azure-Samples/iot-middleware-freertos-samples/tree/dane/reconnect
Hi @danewalton , I have tested this before. It works for like 7 hours, after that there is now heap space available anymore. Every time a new connection is opened the heap increases. I think that there are memory leaks in the code. Best regards, Quinn.
I'm looking into running our Linux sample in valgrind (a little complicated since it needs libpcap and sudo permissions), but I'm not sure that's happening. In the mean time, I think more likely might be memory fragmentation. You can try running the sample with mbedtls static mode turned on if you'd like: https://github.com/Mbed-TLS/mbedtls-docs/blob/main/kb/how-to/using-static-memory-instead-of-the-heap.md
Hi @Quinnvanderschaar2
I linked the above PR which will close this issue, as it resolves the original problem. The memory issue is a different, and as mentioned, I would suggest the static memory capability of mbedTLS to help with possible issues with malloc()
.
Hi @danewalton , I have not yet managed to reproduce a solution with the static memory capability of mbedTLS. I have tried to change the config file but this resulted without any result. Do you have a config file that I could use? Best regards, Quinn.
@danewalton
Hi @danewalton, @ewertons ,
This problem is not solved in the latest version that is available on GitHub. Currently we are heading to a release, so we would appreciate that there will be a solution available for the memory fragmentation/leaks.
Best regards,
Quinn van der Schaar.
@Quinnvanderschaar2 My recommendation for now would be to reboot the device instead of breaking out of the sample loop (as in the branch Dane sent previously) as everything doesn't get completely re-initialized within the demo loop. I'm also having issues running valgrind to debug potential memory issues, and unfortunately don't have a config file for mbedTLS to try out the static memory capabilities.
We are still working on the work item that the PR linked above is a part of to have the device reboot on network issues.
Happy to hear that you are working on the problem. Rebooting seems not like an option.
@mile-mwebert The samples in our repo aren't designed to account for network issues, so it would be a part of your application to re-organize the code to handle network re-initialization.
@vaavva Thanks for your answer. Are you working on an example that includes network reinitialization? From my point of view, handling reconnects is a crucial part when working with IoT devices and since this works the same way for most devices and applications, that could be part of the SDK or an exampl.e Isnt it?
@Quinnvanderschaar2 Have you found the leaking part by chance?
@mile-mwebert , I have looked into it, but it is very hard to work with the SDK when your are not known with the software. In my opinion the re-initialization must also be part of the code, as the MQTT gives a Time-out after some time. This makes the re-initialization a key feature for operating reliably without rebooting.
Unfortunately, these samples were not designed to be production-ready samples. They were designed to simply and succinctly show an example of how to use the sdk. We are not currently working on a sample that includes network re-initialization.
@mile-mwebert , I have looked into it, but it is very hard to work with the SDK when your are not known with the software. In my opinion the re-initialization must also be part of the code, as the MQTT gives a Time-out after some time. This makes the re-initialization a key feature for operating reliably without rebooting.
@Quinnvanderschaar2 I've been having an issue similar to what you described earlier ("Every time a new connection is opened the heap increases") and found a solution. I'm using the sample-azure-iot example on an ESP32 chip using x.509 certificates for authentication, and I narrowed down the problem to the TLS_Socket_Connect and TLS_Socket_Disconnect functions in demos/projects/ESPRESSIF/esp32/components/sample-azure-iot/transport_tls_esp32.c
. Hopefully this applies to your situation too, or anyone else facing this problem.
In TLS_Socket_Connect, there is a call made to esp_tls_set_global_ca_store
, which appears to be making a copy of the given trusted root certificate, but there is never any call made to clean up the cert when there is a connection failure or disconnection. So the next time TLS_Socket_Connect is called, heap memory will be allocated again to store the certificate. Eventually calling this function multiple times will suck up all the heap. To resolve this, I added calls to esp_tls_free_global_ca_store()
in both TLS_Socket_Connect and TLS_Socket_Disconnect to ensure certificates aren't being continuously added.
Edit: to help prove this was the behavior occurring, I verified heap memory usage over time by making periodic calls to esp_get_free_heap_size()
and esp_get_minimum_free_heap_size()
throughout the TLS connect/disconnect process.
@Vivero we were running into this memory issue after upgrading the Azure SDK and basing the azure connection in our application on this example. This call to esp_tls_free_global_ca_store()
is definitely needed, and we're no longer seeing the heap memory run out after the application runs for a long period of time. Thank you for sharing!
@Vivero Thank you for sharing!!!. This also solved my issues.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
Expected/desired behavior
OS and Version?
Mention any other details that might be useful