cesanta / mongoose-os

Mongoose OS - an IoT Firmware Development Framework. Supported microcontrollers: ESP32, ESP8266, CC3220, CC3200, STM32F4, STM32L4, STM32F7. Amazon AWS IoT, Microsoft Azure, Google IoT Core integrated. Code in C or JavaScript.
https://mongoose-os.com
Other
2.48k stars 430 forks source link

Running out of RAM during OTA update #587

Closed fabio-mantelli closed 2 years ago

fabio-mantelli commented 2 years ago

Hello, I am developing a device for my company using mongooseOS, and I am currently having issues during OTA updates through shadow. It appears to be that the OTA fails sometimes when the device tries to allocate memory, but it happens in different stages of the OTA process. Sometimes just after receiving the HTTP header and other times when it is writing the fs.bin file. My suspicion is that memory fragmentation might be an issue.

I am using MJS and lots of different libraries for this project, which makes space an issue. For example, I have removed the comments of most of the mongoose libraries and reduced the ca and certificate files to a minimum in order to save enough space in the fs. However, the issue I am seeing now I believe is in the RAM.

I am using mos tool 2.18.0, because I started the project with this version and for some obscure reason, If I update, my project won't build.

Mostly I have been able to work around the issues I have been facing by sorting through the source code of the libraries, available in the mongoose-os-libs. However, I do not have access to the mgos_ota_core.c source file and others, where I may have wanted to insert a print or something to help on debug.

For the issue with the malloc failing after we get the HTTP headers, I believe this could be fixed by enabling the firmware download through range Get requests.

For the issue with writing the fs.bin file to the new partition, I am not exactly sure what to do to fix it, but to try to find other ways to clear or defragment the RAM. This has become a big issue, since we have just bought a bunch of licenses to release our product and now the OTA update have been failing with our latest FW versions. How can I increase the available RAM memory during runtime? I am not doing any explicit dynamic memory allocation on my code, but perhaps some libaries are. Any suggestions at all?

Small part of the debug log where the failed malloc occurs. complete logs are attached. I have hidden some information that might be sensitive on the log, but it should not make a difference in helping me with this problem. I have also added some prints to help with debug.

HTTP fail

[Nov 18 08:19:52.929] mgos_ota_http_clien:130 Looking for HTTP header
[Nov 18 08:19:53.082] E:M 16421 (2054 blocks)
[Nov 18 08:19:53.082] mg_ssl_if_mbedtls.c:30  0x3fff1bbc alloc 16421 failed
[Nov 18 08:19:53.143] mg_ssl_if_mbedtls.c:30  0x3fff1bbc mbedtls_ssl_fetch_input() returned -32512 (-0x7f00)
[Nov 18 08:19:53.143] mg_ssl_if_mbedtls.c:30  0x3fff1bbc ssl_get_next_record() returned -32512 (-0x7f00)
[Nov 18 08:19:53.143] mg_ssl_if_mbedtls.c:30  0x3fff1bbc mbedtls_ssl_read_record() returned -32512 (-0x7f00)
[Nov 18 08:19:53.144] mg_ssl_if_mbedtls.c:206 0x3fff1bbc mbedTLS error: -0x7f00
[Nov 18 08:19:53.144] mgos_ota_http_clien:186 Connection closed
[Nov 18 08:19:56.127] mg_http.c:2991          https://<hidden>/shelly/01/fw_1_0_6.zip use_ssl? 1 tcp://<hidden>
[Nov 18 08:19:56.266] mg_net.c:907            0x3fff1bbc tcp://<hidden> -,-,ca.pem
[Nov 18 08:19:56.266] mgos_vfs.c:280          ca.pem -> /ca.pem pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:19:56.267] mgos_vfs.c:375          open ca.pem 0x0 0x1b6 => 0x3ffeff44 ca.pem 1 => 257 (refs 11)
[Nov 18 08:19:56.268] mgos_vfs.c:409          close 257 => 0x3ffeff44:1 => 0 (refs 10)
[Nov 18 08:19:56.268] mg_net.c:907            0x3fff6d5c udp://8.8.8.8:53 -,-,-
[Nov 18 08:19:56.269] mg_net.c:777            0x3fff6d5c udp://8.8.8.8:53
[Nov 18 08:19:56.269] mgos_ota_http_clien:252 OTA download https://<hidden>/shelly/01/fw_1_0_6.zip   
[Nov 18 08:19:56.270] [GET /shelly/01/fw_1_0_6.zip HTTP/1.1
[Nov 18 08:19:56.270] Host: <hidden>
[Nov 18 08:19:56.271] Content-Length: 0
[Nov 18 08:19:56.271] Device-Memory: 0.25
[Nov 18 08:19:56.272] Range: bytes=1460-
[Nov 18 08:19:56.272] Connection: close
[Nov 18 08:19:56.273] X-MGOS-Device-ID: <hidden> <hidden>
[Nov 18 08:19:56.273] User-Agent: ota-http-client/1.0.6 (MongooseOS/202111172114; esp8266)
[Nov 18 08:19:56.274] X-MGOS-FW-Version: esp8266 1.0.6 20211117-211429/geea79ab-main-dirty
[Nov 18 08:19:56.274]
[Nov 18 08:19:56.275] ]
[Nov 18 08:19:56.275] mg_net.c:791            0x3fff6d5c udp://8.8.8.8:53 -> 0
[Nov 18 08:19:56.276] mg_net.c:777            0x3fff1bbc tcp://<hidden>:443
[Nov 18 08:19:56.433] mg_net.c:791            0x3fff1bbc tcp://<hidden>:443 -> 0
[Nov 18 08:19:57.060] mg_ssl_if_mbedtls.c:30  0x3fff1bbc ciphersuite: TLS-ECDHE-RSA-WITH-AES-128-GCM-SHA256
[Nov 18 08:19:57.470] mgos_vfs.c:280          ca.pem -> /ca.pem pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:19:57.471] mgos_vfs.c:375          open ca.pem 0x0 0x1b6 => 0x3ffeff44 ca.pem 1 => 257 (refs 11)
[Nov 18 08:19:57.472] mgos_vfs.c:535          fstat 257 => 0x3ffeff44:1 => 0 (size 8106)
[Nov 18 08:19:57.515] mgos_vfs.c:409          close 257 => 0x3ffeff44:1 => 0 (refs 10)
[Nov 18 08:19:57.562] mgos_vfs.c:280          ca.pem -> /ca.pem pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:19:57.593] mgos_vfs.c:375          open ca.pem 0x0 0x1b6 => 0x3ffeff44 ca.pem 1 => 257 (refs 11)
[Nov 18 08:19:57.594] mgos_vfs.c:535          fstat 257 => 0x3ffeff44:1 => 0 (size 8106)
[Nov 18 08:19:57.700] mgos_vfs.c:409          close 257 => 0x3ffeff44:1 => 0 (refs 10)
[Nov 18 08:20:04.360] SW ECDH curve 3
[Nov 18 08:20:05.213] mgos_ota_http_clien:130 Looking for HTTP header
[Nov 18 08:20:05.579] E:M 16421 (2054 blocks)
[Nov 18 08:20:05.580] mg_ssl_if_mbedtls.c:30  0x3fff1bbc alloc 16421 failed
[Nov 18 08:20:05.581] mg_ssl_if_mbedtls.c:30  0x3fff1bbc mbedtls_ssl_fetch_input() returned -32512 (-0x7f00)
[Nov 18 08:20:05.581] mg_ssl_if_mbedtls.c:30  0x3fff1bbc ssl_get_next_record() returned -32512 (-0x7f00)
[Nov 18 08:20:05.582] mg_ssl_if_mbedtls.c:30  0x3fff1bbc mbedtls_ssl_read_record() returned -32512 (-0x7f00)
[Nov 18 08:20:05.583] mg_ssl_if_mbedtls.c:206 0x3fff1bbc mbedTLS error: -0x7f00
[Nov 18 08:20:05.583] mgos_ota_http_clien:186 Connection closed

fs.bin fail

[Nov 18 08:11:17.153] mgos_ota_core.c:265     Update state 4 -> 5
[Nov 18 08:11:17.153] mgos_ota_core.c:504     84.42% total, esp_init_data_default_v08.bin 128 of 128
[Nov 18 08:11:17.154] esp_ota_backend.c:569   cur 0 prev 0 fwu 0
[Nov 18 08:11:17.154] mgos_vfs.c:280          updater.dat -> /updater.dat pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:17.154] mgos_vfs.c:375          open updater.dat 0x0 0x1b6 => 0x3ffeff44 updater.dat -1 => -1 (refs 10)
[Nov 18 08:11:17.155] mgos_aws_shadow.c:475   Update: {"state": {"reported": {"ota":{"message": "Waiting for data", "status": 1, "is_committed": true, "partition": 0, "progress_percent": 84}}}, "clientToken": "d51d59a1"}
[Nov 18 08:11:17.155] mgos_vfs.c:280          / -> / pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:17.404] mgos_vfs_fs_spiffs.:803 GC result -10001 del pages 55 -> 52
[Nov 18 08:11:17.738] mgos_vfs_fs_spiffs.:803 GC result -10001 del pages 52 -> 55
[Nov 18 08:11:17.738] mgos_vfs.c:1046         / 0x3ffeff44 1
[Nov 18 08:11:17.739] mgos_aws_shadow.c:477   Total RAM: 53168 / Free RAM: 5456 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[Nov 18 08:11:17.739] mgos_mqtt_conn.c:199    MQTT0 pub -> 25 $aws/things/<hidden>/shadow/update @ 1 (166): [{"state": {"reported": {"ota":{"message": "Waiting for data", "status": 1, "is_committed": true, "partition": 0, "progress_percent": 84}}}, "clientToken": "d51d59a1"}]
[Nov 18 08:11:17.799] mgos_event.c:134        ev SDW1 triggered 2 handlers
[Nov 18 08:11:17.799] mgos_event.c:134        ev OTA1 triggered 2 handlers
[Nov 18 08:11:17.800] mgos_ota_core.c:729     esp_init_data_default_v08.bin: CRC OK
[Nov 18 08:11:17.801] mgos_ota_core.c:265     Update state 5 -> 6
[Nov 18 08:11:17.802] mgos_ota_core.c:265     Update state 6 -> 4
[Nov 18 08:11:17.802] mgos_ota_core.c:330     File name: FW_shelly1-1.0.6/fs.bin
[Nov 18 08:11:17.803] mgos_ota_core.c:332     Method 0 cs 131072 ucs 131072
[Nov 18 08:11:17.803] mgos_ota_core.c:353     File name to use: fs.bin
[Nov 18 08:11:17.803] mgos_ota_core.c:362     File size: 131072
[Nov 18 08:11:17.803] mgos_ota_core.c:366     General flag=0x0
[Nov 18 08:11:17.804] mgos_ota_core.c:370     File CRC32: 0x7d2dfc0c
[Nov 18 08:11:17.905] esp_ota_backend.c:224   SHA1 131072 @ 0x1db000 = b1053303cbbc5bf73469c2db83e67501db97458d, want 555f94d3eb6e9f139a1eec8854bf7b44f900bd6d
[Nov 18 08:11:17.906] esp_ota_backend.c:338   Start writing fs.bin (131072) @ 0x1db000
[Nov 18 08:11:17.907] mgos_ota_core.c:671     fs.bin claimed by backend 0
[Nov 18 08:11:17.908] mgos_ota_core.c:265     Update state 4 -> 5
[Nov 18 08:11:17.908] esp_flash_writer.c:76   Erase sector 475 (0x1db000) -> 0
[Nov 18 08:11:17.967] mgos_ota_core.c:504     84.49% total, fs.bin 512 of 131072
[Nov 18 08:11:17.967] esp_ota_backend.c:569   cur 0 prev 0 fwu 0
[Nov 18 08:11:17.968] mgos_vfs.c:280          updater.dat -> /updater.dat pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:17.969] mgos_vfs.c:375          open updater.dat 0x0 0x1b6 => 0x3ffeff44 updater.dat -1 => -1 (refs 10)
[Nov 18 08:11:17.969] mgos_aws_shadow.c:475   Update: {"state": {"reported": {"ota":{"message": "Waiting for data", "status": 1, "is_committed": true, "partition": 0, "progress_percent": 84}}}, "clientToken": "d51d59a1"}
[Nov 18 08:11:17.969] mgos_vfs.c:280          / -> / pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:18.286] mgos_vfs_fs_spiffs.:803 GC result -10001 del pages 55 -> 56
[Nov 18 08:11:18.287] mgos_vfs.c:1046         / 0x3ffeff44 1
[Nov 18 08:11:18.288] mgos_aws_shadow.c:477   Total RAM: 53168 / Free RAM: 3496 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[Nov 18 08:11:18.290] mgos_mqtt_conn.c:199    MQTT0 pub -> 26 $aws/things/<hidden>/shadow/update @ 1 (166): [{"state": {"reported": {"ota":{"message": "Waiting for data", "status": 1, "is_committed": true, "partition": 0, "progress_percent": 84}}}, "clientToken": "d51d59a1"}]
[Nov 18 08:11:18.317] mgos_event.c:134        ev SDW1 triggered 2 handlers
[Nov 18 08:11:18.319] mgos_event.c:134        ev OTA1 triggered 2 handlers
[Nov 18 08:11:18.319] mgos_mqtt_conn.c:222    MQTT0 event: 204
[Nov 18 08:11:18.319] mgos_mqtt_conn.c:164    MQTT0 ack 25
[Nov 18 08:11:18.409] E:M 1460 (183 blocks)
[Nov 18 08:11:18.410] mgos_mqtt_conn.c:263    MQTT0 Disconnect
[Nov 18 08:11:18.410] mgos_event.c:134        ev MOS5 triggered 1 handlers
[Nov 18 08:11:18.411] mgos_event.c:134        ev MOS5 triggered 1 handlers
[Nov 18 08:11:18.411] mgos_mqtt_conn.c:556    MQTT0 connecting after 1989 ms
[Nov 18 08:11:18.440] esp_flash_writer.c:76   Erase sector 476 (0x1dc000) -> 0
[Nov 18 08:11:20.482] esp_flash_writer.c:76   Erase sector 477 (0x1dd000) -> 0
[Nov 18 08:11:20.483] mgos_mqtt_conn.c:478    MQTT0 connecting to <hidden>:8883
[Nov 18 08:11:20.484] mgos_event.c:134        ev MOS6 triggered 1 handlers
[Nov 18 08:11:20.485] mg_net.c:907            0x3fff1874 <hidden>:8883 deviceAndCACert.crt,device.key,ca.pem
[Nov 18 08:11:20.487] mgos_vfs.c:280          deviceAndCACert.crt -> /deviceAndCACert.crt pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:20.487] mgos_vfs.c:375          open deviceAndCACert.crt 0x0 0x1b6 => 0x3ffeff44 deviceAndCACert.crt 1 => 257 (refs 11)     
[Nov 18 08:11:20.489] mgos_vfs.c:535          fstat 257 => 0x3ffeff44:1 => 0 (size 2472)
[Nov 18 08:11:20.489] mgos_vfs.c:535          fstat 257 => 0x3ffeff44:1 => 0 (size 2472)
[Nov 18 08:11:20.490] mgos_vfs.c:563          lseek 257 0 1 => 0x3ffeff44:1 => 0
[Nov 18 08:11:20.490] mgos_vfs.c:563          lseek 257 2048 0 => 0x3ffeff44:1 => 2048
[Nov 18 08:11:20.490] mgos_vfs.c:563          lseek 257 0 0 => 0x3ffeff44:1 => 0
[Nov 18 08:11:20.490] mgos_vfs.c:409          close 257 => 0x3ffeff44:1 => 0 (refs 10)
[Nov 18 08:11:20.491] mgos_vfs.c:280          device.key -> /device.key pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:20.818] mgos_vfs.c:375          open device.key 0x0 0x1b6 => 0x3ffeff44 device.key 1 => 257 (refs 11)
[Nov 18 08:11:20.819] mgos_vfs.c:535          fstat 257 => 0x3ffeff44:1 => 0 (size 1702)
[Nov 18 08:11:20.820] mgos_vfs.c:535          fstat 257 => 0x3ffeff44:1 => 0 (size 1702)
[Nov 18 08:11:20.821] mgos_vfs.c:563          lseek 257 0 1 => 0x3ffeff44:1 => 0
[Nov 18 08:11:20.822] mgos_vfs.c:563          lseek 257 1024 0 => 0x3ffeff44:1 => 1024
[Nov 18 08:11:20.822] mgos_vfs.c:563          lseek 257 0 0 => 0x3ffeff44:1 => 0
[Nov 18 08:11:20.823] mgos_vfs.c:409          close 257 => 0x3ffeff44:1 => 0 (refs 10)
[Nov 18 08:11:20.823] mgos_vfs.c:280          ca.pem -> /ca.pem pl 1 -> 1 0x3ffeff44 (refs 11)
[Nov 18 08:11:20.823] mgos_vfs.c:375          open ca.pem 0x0 0x1b6 => 0x3ffeff44 ca.pem 1 => 257 (refs 11)
[Nov 18 08:11:20.824] mgos_vfs.c:409          close 257 => 0x3ffeff44:1 => 0 (refs 10)
[Nov 18 08:11:20.824] E:M 1056 (133 blocks)
[Nov 18 08:11:20.824]
[Nov 18 08:11:20.824] Exception 29 @ 0x4000e1b2, vaddr 0x00000000
[Nov 18 08:11:20.825]  A0: 0x4028b2b3  A1: 0x3ffffbf0  A2: 0x00000000  A3: 0x00000000
[Nov 18 08:11:20.825]  A4: 0x00000420  A5: 0x00000000  A6: 0x00000420  A7: 0x00000042
[Nov 18 08:11:20.825]  A8: 0x3ffef8d0  A9: 0x00000190 A10: 0x00000009 A11: 0x00000020
[Nov 18 08:11:20.825] A12: 0x00000420 A13: 0x00000000 A14: 0x00000001 A15: 0x3ffffca0
[Nov 18 08:11:20.826]
[Nov 18 08:11:20.826] (exc SP: 0x3ffffa50)
[Nov 18 08:11:20.826]
[Nov 18 08:11:20.826] --- BEGIN CORE DUMP ---
[Nov 18 08:11:20.828] mos: catching core dump
[Nov 18 08:11:23.571] ..

log_HTTP_header_fail.txt log_fs_bin_fail.txt

rojer commented 2 years ago

E:M 16421 (2054 blocks)

this is a known issue with TLS - records can be up to 16K in size and it entirely depends on the server. if it decides to send us 16K of data, we hav eno choice but to deal with it. well, the way to deal with it is record fragmentation but as of today it is not supported by mbedtls. the only possible workaround is to manipulate server side to send data in smaller chunks. that is, to flush data more often.

fabio-mantelli commented 2 years ago

Thanks for the answer. I have noticed in the debug that in the HTTP request header, there is the "Range" field, but only with the range start, but not the end. Perhaps an alteration could be made and a config parameter could be added where we can set the maximum response chunk size and the library automatically sets the range start and end in the HTTP header. Since there is a "extra_http_headers" config in the ota-http-client library, I imagine such functionality is possible. I think this would solve the issue with servers that can reply adequately to such requests.

We are using AWS cloud platform, and we are having trouble finding a way to manipulate the S3 or CloudFront, or any other service to send smaller chunks.

The work around we found was to whenever we get an OTA update request from shadow, we disable some features on the mongoose os and reboot the device with just enough functionalities to perform OTA through a local RPC call. This way we free some RAM when we wish to perform OTA updates. After that we enable the features again in case the update fails and reboot the device.

I would like to recommend that this issue is mentioned somewhere in documentation.