Azure-Samples / iot-middleware-freertos-samples

This repo has samples for dev kits using the Azure IoT middleware for FreeRTOS
MIT License
76 stars 46 forks source link

PnP crashes after 64 minutes #338

Closed Quinnvanderschaar2 closed 1 year ago

Quinnvanderschaar2 commented 1 year ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

esp32 pnp example

Any log messages given by the failure

Expected/desired behavior

After one hour of sending telemetry messages each minute, the device crashes and gives the following error: I (3947613) AZ IOT: -----------------send telemetry----------------- I (3947633) AZ IOT: Successfully sent telemetry message I (3947633) AZ IOT: Attempt to receive publish message from IoT Hub. E (3947643) TRANSPORT_BASE: poll_read select error 104, errno = Connection reset by peer, fd = 54 E (3947643) esp_tls: Reading failed, errno= 119 E (3947643) MQTT: A single byte was not read from the transport: transportStatus=-1. E (3947663) MQTT: Receiving incoming packet length failed. Status=MQTTRecvFailed E (3947663) MQTT: Exiting process loop due to failure: ErrorStatus=MQTTRecvFailed E (3947673) AZ IOT: AzureIoTMQTT_ProcessLoop failed: ProcessLoopDuration=30, MQTT error=0x00000004

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Linux Mint

Versions

I ESP-IDF v4.4.3-159-g54dcccc1f6-dirty

Mention any other details that might be useful


Thanks! We'll be in touch soon.

danewalton commented 1 year ago

Hi @Quinnvanderschaar2 I assume you're using SAS keys for auth?

Quinnvanderschaar2 commented 1 year ago

Yes

danewalton commented 1 year ago

I have this repro'd. Let me look into this.

danewalton commented 1 year ago

Hey @Quinnvanderschaar2 If you have a moment, can you try this branch here and see if it helps? This will let the device attempt a reconnect on process loop failure. In MQTT 3, a token expiration requires a disconnect and reconnect, which this does. Also to note for you and others, the extra ~5 minutes above the default 60 minutes is to account for potential clock drift.

https://github.com/Azure-Samples/iot-middleware-freertos-samples/tree/dane/reconnect

Quinnvanderschaar2 commented 1 year ago

Hi @danewalton , I have tested this before. It works for like 7 hours, after that there is now heap space available anymore. Every time a new connection is opened the heap increases. I think that there are memory leaks in the code. Best regards, Quinn.

danewalton commented 1 year ago

I'm looking into running our Linux sample in valgrind (a little complicated since it needs libpcap and sudo permissions), but I'm not sure that's happening. In the mean time, I think more likely might be memory fragmentation. You can try running the sample with mbedtls static mode turned on if you'd like: https://github.com/Mbed-TLS/mbedtls-docs/blob/main/kb/how-to/using-static-memory-instead-of-the-heap.md

danewalton commented 1 year ago

Hi @Quinnvanderschaar2 I linked the above PR which will close this issue, as it resolves the original problem. The memory issue is a different, and as mentioned, I would suggest the static memory capability of mbedTLS to help with possible issues with malloc().

Quinnvanderschaar2 commented 1 year ago

Hi @danewalton , I have not yet managed to reproduce a solution with the static memory capability of mbedTLS. I have tried to change the config file but this resulted without any result. Do you have a config file that I could use? Best regards, Quinn.

Quinnvanderschaar2 commented 1 year ago

@danewalton

Quinnvanderschaar2 commented 1 year ago

Hi @danewalton, @ewertons ,
This problem is not solved in the latest version that is available on GitHub. Currently we are heading to a release, so we would appreciate that there will be a solution available for the memory fragmentation/leaks. Best regards, Quinn van der Schaar.

vaavva commented 1 year ago

@Quinnvanderschaar2 My recommendation for now would be to reboot the device instead of breaking out of the sample loop (as in the branch Dane sent previously) as everything doesn't get completely re-initialized within the demo loop. I'm also having issues running valgrind to debug potential memory issues, and unfortunately don't have a config file for mbedTLS to try out the static memory capabilities.

We are still working on the work item that the PR linked above is a part of to have the device reboot on network issues.

libreo-mwebert commented 1 year ago

Happy to hear that you are working on the problem. Rebooting seems not like an option.

vaavva commented 1 year ago

@mile-mwebert The samples in our repo aren't designed to account for network issues, so it would be a part of your application to re-organize the code to handle network re-initialization.

libreo-mwebert commented 1 year ago

@vaavva Thanks for your answer. Are you working on an example that includes network reinitialization? From my point of view, handling reconnects is a crucial part when working with IoT devices and since this works the same way for most devices and applications, that could be part of the SDK or an exampl.e Isnt it?

@Quinnvanderschaar2 Have you found the leaking part by chance?

Quinnvanderschaar2 commented 1 year ago

@mile-mwebert , I have looked into it, but it is very hard to work with the SDK when your are not known with the software. In my opinion the re-initialization must also be part of the code, as the MQTT gives a Time-out after some time. This makes the re-initialization a key feature for operating reliably without rebooting.

vaavva commented 1 year ago

Unfortunately, these samples were not designed to be production-ready samples. They were designed to simply and succinctly show an example of how to use the sdk. We are not currently working on a sample that includes network re-initialization.

Vivero commented 1 year ago

@mile-mwebert , I have looked into it, but it is very hard to work with the SDK when your are not known with the software. In my opinion the re-initialization must also be part of the code, as the MQTT gives a Time-out after some time. This makes the re-initialization a key feature for operating reliably without rebooting.

@Quinnvanderschaar2 I've been having an issue similar to what you described earlier ("Every time a new connection is opened the heap increases") and found a solution. I'm using the sample-azure-iot example on an ESP32 chip using x.509 certificates for authentication, and I narrowed down the problem to the TLS_Socket_Connect and TLS_Socket_Disconnect functions in demos/projects/ESPRESSIF/esp32/components/sample-azure-iot/transport_tls_esp32.c. Hopefully this applies to your situation too, or anyone else facing this problem.

In TLS_Socket_Connect, there is a call made to esp_tls_set_global_ca_store, which appears to be making a copy of the given trusted root certificate, but there is never any call made to clean up the cert when there is a connection failure or disconnection. So the next time TLS_Socket_Connect is called, heap memory will be allocated again to store the certificate. Eventually calling this function multiple times will suck up all the heap. To resolve this, I added calls to esp_tls_free_global_ca_store() in both TLS_Socket_Connect and TLS_Socket_Disconnect to ensure certificates aren't being continuously added.

Edit: to help prove this was the behavior occurring, I verified heap memory usage over time by making periodic calls to esp_get_free_heap_size() and esp_get_minimum_free_heap_size() throughout the TLS connect/disconnect process.

mikemurray3 commented 1 year ago

@Vivero we were running into this memory issue after upgrading the Azure SDK and basing the azure connection in our application on this example. This call to esp_tls_free_global_ca_store() is definitely needed, and we're no longer seeing the heap memory run out after the application runs for a long period of time. Thank you for sharing!

Quinnvanderschaar2 commented 1 year ago

@Vivero Thank you for sharing!!!. This also solved my issues.