espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.84k stars 7.32k forks source link

lwip pppos suddenly runs out of memory (IDFGH-12929) #13886

Open brunohpg opened 5 months ago

brunohpg commented 5 months ago

Answers checklist.

IDF version.

ESP-IDF v5.2.1

Espressif SoC revision.

ESP32 v3.0

Operating System used.

Windows

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

CMD

Development Kit.

Custom

Power Supply used.

External 5V

What is the expected behavior?

LwIP PPPoS allocates memory through a pool of pBuf. I believe there should be an allocation limit once it is validated whether the pbuf has been allocated or not.

https://github.com/espressif/esp-lwip/blob/f174681317a00d209e374c66d9691ef79b4195de/src/netif/ppp/pppos.c#L681-L683

Therefore, I would expect that PPPoS would not consume all system memory.

What is the actual behavior?

It seems to me that there is no allocation limit through PPPoS. I believe that when there is a glitch in UART reception and bits are exchanged, PPPoS can understand that it must create several packets and generate a lack of memory. It's an assumption.

I think it might not be related to esp-idf, but rather lwip. If anyone has any deeper knowledge of Lwip, they could help me unravel the problem.

I made a printout of available CAP8 memory. There is "a lot" of memory available before the lwip task allocates it to pppos, but even so, it runs out of memory.

Steps to reproduce.

There's no special step. Just use PPPoS. I think we should expect for some uart glitch.

Debug Logs.

I (3741804) Monitor: -- TASKS
Heap summary for capabilities 0x00000004:
  At 0x3ffae6e0 len 6432 free 4 allocated 5792 min_free 4
    largest_free_block 0 alloc_blocks 64 free_blocks 0 total_blocks 64
  At 0x3ffbedf8 len 135688 free 1740 allocated 131512 min_free 4
    largest_free_block 1088 alloc_blocks 374 free_blocks 8 total_blocks 382
  At 0x3ffe0440 len 15072 free 1700 allocated 12828 min_free 4
    largest_free_block 640 alloc_blocks 32 free_blocks 6 total_blocks 38
  At 0x3ffe4350 len 113840 free 66300 allocated 46536 min_free 54620
    largest_free_block 59392 alloc_blocks 33 free_blocks 5 total_blocks 38
  Totals:
    free 69744 allocated 196668 min_free 54632 largest_free_block 59392

I (3742248) VSock: VSecure Pre-alocado 3ffb7c40

Mem alloc fail. size 0x000005e4 caps 0x00001800

Backtrace: 0x400820a6:0x3ffc7990 0x4008c8dd:0x3ffc79b0 0x400826e7:0x3ffc79d0 0x40082803:0x3ffc7a30 0x4008281d:0x3ffc7a50 0x40098c55:0x3ffc7a70 0x400fac75:0x3ffc7a90 0x400fad30:0x3ffc7ab0 0x400fad9b:0x3ffc7ad0 0x40096252:0x3ffc7af0 0x40102b00:0x3ffc7b10 0x40102b6e:0x3ffc7b30 0x40097095:0x3ffc7b50 0x400918f5:0x3ffc7b80
0x400820a6: panic_abort at D:/esp32-release5-2/components/esp_system/panic.c:472
0x4008c8dd: esp_system_abort at D:/esp32-release5-2/components/esp_system/port/esp_system_chip.c:93
0x400826e7: heap_caps_alloc_failed at D:/esp32-release5-2/components/heap/heap_caps.c:96
0x40082803: heap_caps_malloc at D:/esp32-release5-2/components/heap/heap_caps.c:206
0x4008281d: heap_caps_malloc_default at D:/esp32-release5-2/components/heap/heap_caps.c:228
0x40098c55: malloc at D:/esp32-release5-2/components/newlib/heap.c:24
0x400fac75: mem_malloc at D:/esp32-release5-2/components/lwip/lwip/src/core/mem.c:209
0x400fad30: do_memp_malloc_pool at D:/esp32-release5-2/components/lwip/lwip/src/core/memp.c:254
0x400fad9b: memp_malloc at D:/esp32-release5-2/components/lwip/lwip/src/core/memp.c:350
0x40096252: pbuf_alloc at D:/esp32-release5-2/components/lwip/lwip/src/core/pbuf.c:243
0x40102b00: pppos_input at D:/esp32-release5-2/components/lwip/lwip/src/netif/ppp/pppos.c:681
0x40102b6e: pppos_input_sys at D:/esp32-release5-2/components/lwip/lwip/src/netif/ppp/pppos.c:448
0x40097095: tcpip_thread_handle_msg at D:/esp32-release5-2/components/lwip/lwip/src/api/tcpip.c:174
 (inlined by) tcpip_thread at D:/esp32-release5-2/components/lwip/lwip/src/api/tcpip.c:148
0x400918f5: vPortTaskWrapper at D:/esp32-release5-2/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

More Information.

I do not have the knowledge to know whether changing bits within PPPoS could cause the system to generate a very large number of packets.

There is another assumption: a large amount of information arrived at the same time via PPPoS.

In any case, it is not interesting to depend on PPPoS input. It would be more interesting to lose packets than to restart the system.

PS: My system has other tasks that allocate memory. I always give preference to static memory and run it for days for memory analysis. I believe the problem is not related to other tasks.

david-cermak commented 5 months ago

I don't think this is happening due to some glitches on UART wires. It might be theoretically possible that the modem would send a lot of data at once, but according to what I saw, most of the devices refrain from sending more data after 3-4kB (without response), which shouldn't cause too much trouble. We can of course simulate this condition, and I agree that this would cause trouble; the RAM pbufs are normally allocated on heap and there's nothing preventing from allocating them indefinitely. But I'd like to understand where this issue comes from, before trying to fix it. lwip provides a very nice stats module, where you can easily check pbuf allocation statistics. Could you please check that you're really running out of memory because of PPPoS pbuf allocations?

brunohpg commented 5 months ago

Sure. I don't know how to do it, but I'll look for how to enable stats and get back to you.

I was reviewing the path to allocate pbuf and I found an interesting point. Check out this small piece of code:

https://github.com/espressif/esp-lwip/blob/f174681317a00d209e374c66d9691ef79b4195de/src/netif/ppp/pppos.c#L658-L682

By default, the variable pbuf_alloc_len = 0. Only when using IP_FORWARD (and I do!) is there a defined value. From what I can see, the value of the constant should be 1508 bytes, because that's what the system tried to allocate. It's a high value.

At first I imagined that there could be several pppos inputs in the same call, but there isn't. There's only one input for each.

However, there can be multiple PDDATA within the same pppos input. In fact, they can be joins of multiple inputs as well. The first PDDATA of each packet will allocate 1508 bytes. So I don't need a very large input packet to get trouble. Nor do I need a very large input. If there are 40 1-byte packets in the same input, probably 60k won't be enough.

Does that make sense?

Some UDP flood will break the system.

brunohpg commented 5 months ago

I'm running the system connected to the Wi-Fi for one week. There shouldn't have been a problem and there wasn't! I'm not using PPPoS at this point.

I am sending the log for two reasons: the other tasks are not generating memory problems and lwip is not recording memory consumption. I believe that MEM_STATS and MEMP_STATS should be active.

I will enable the defines in CMakelist and force the use of pppos. I will come back soon with details.

I (614875685) vRTC: RTC Data: 15/06/2024 19:03:53 (1718478233)
I (614875692) vRTC: Sys Startup: 08/06/2024 16:15:58 (1717863358)

Heap summary for capabilities 0x00001800:
  At 0x3ffae6e0 len 6432 free 132 allocated 5680 min_free 4
    largest_free_block 88 alloc_blocks 60 free_blocks 3 total_blocks 63
  At 0x3ffbf6f0 len 133392 free 932 allocated 129972 min_free 4
    largest_free_block 272 alloc_blocks 387 free_blocks 10 total_blocks 397
  At 0x3ffe0440 len 15072 free 400 allocated 14192 min_free 4
    largest_free_block 192 alloc_blocks 16 free_blocks 4 total_blocks 20
  At 0x3ffe4350 len 113840 free 49880 allocated 62808 min_free 37816
    largest_free_block 47104 alloc_blocks 70 free_blocks 5 total_blocks 75
  Totals:
    free 51344 allocated 212652 min_free 37828 largest_free_block 47104

LINK
        xmit: 39
        recv: 8
        fw: 0
        drop: 1
        chkerr: 0
        lenerr: 0
        memerr: 0
        rterr: 1
        proterr: 0
        opterr: 0
        err: 0
        cachehit: 0

ETHARP
        xmit: 3891
        recv: 64709
        fw: 0
        drop: 18119
        chkerr: 0
        lenerr: 0
        memerr: 0
        rterr: 0
        proterr: 18119
        opterr: 0
        err: 0
        cachehit: 48342

IP_FRAG
        xmit: 0
        recv: 0
        fw: 0
        drop: 0
        chkerr: 0
        lenerr: 0
        memerr: 0
        rterr: 0
        proterr: 0
        opterr: 0
        err: 0
        cachehit: 0

IP
        xmit: 8709
        recv: 35578
        fw: 0
        drop: 30
        chkerr: 0
        lenerr: 30
        memerr: 0
        rterr: 10
        proterr: 0
        opterr: 0
        err: 0
        cachehit: 0

IGMP
        xmit: 0
        recv: 7781
        drop: 2249
        chkerr: 0
        lenerr: 0
        memerr: 0
        proterr: 0
        rx_v1: 0
        rx_group: 0
        rx_general: 5265
        rx_report: 267
        tx_join: 0
        tx_leave: 0
        tx_report: 0

ICMP
        xmit: 93
        recv: 2967
        fw: 0
        drop: 10
        chkerr: 0
        lenerr: 0
        memerr: 0
        rterr: 0
        proterr: 10
        opterr: 0
        err: 0
        cachehit: 0

UDP
        xmit: 28576
        recv: 20965
        fw: 0
        drop: 87
        chkerr: 0
        lenerr: 0
        memerr: 0
        rterr: 10
        proterr: 87
        opterr: 0
        err: 0
        cachehit: 6832

TCP
        xmit: 51638
        recv: 1335
        fw: 0
        drop: 1290
        chkerr: 0
        lenerr: 0
        memerr: 0
        rterr: 0
        proterr: 1290
        opterr: 0
        err: 0
        cachehit: 45

SYS
        sem.used:  0
        sem.max:   0
        sem.err:   0
        mutex.used: 0
        mutex.max:  0
        mutex.err:  0
        mbox.used:  0
        mbox.max:   0
        mbox.err:   0

IP NAPT
        nr_active_tcp:        0
        nr_active_udp:        0
        nr_active_icmp:       0
        nr_forced_evictions:  0
damiencampanella commented 4 months ago

I think I am reproducing the same out of memory issue.

I have a similar setup (same IDF, build method and ESP32S3 SOC) and I have activated as well IP forwarding. I'm getting a lot of "pppos_input_tcpip failed with -1" errors. I think these errors are pbuf allocation failures.

My setup is the following: I have a SIMCOM A7602 modem that is driven through UART to get a PPPoS connection. I'm trying to have a remote access to the local network that is connected through Ethernet to the ESP32 (via a W5500 module). That's why I added a wireguard VPN (to have a fix IP) and activated IP forwarding and the NAPT functionalities between the wireguard netif and the ethernet netif.

This setup works well but the connection is awfully slow, due to a large number of memory allocation failures, and it sometime leads to the same heap_caps_alloc_failed crash.

@david-cermak I posted this here because I think our issue are related but I can open a new issue if you think this is not the case.

I tried activating lwip stats but I did not manage to get more data than the one sent by brunohpg. I willl try to activate MEM_STATS and MEMP_STATS to get more logs.

brunohpg commented 4 months ago

@damiencampanella @david-cermak

I have actived lwip mem stats. But the problem is not generated over time. I did not see a significant increase in memory over time. It is something more acute.

I also did a udp flood and that did not generate the problem. I really do not know what the problem is. Maybe it is something really in the corruption of PPP packets. But this should have a checksum for prevention (assumption: I do not know how PPP packets are formed).

There is a problem with lwip mem stats. We cannot enable MEM_STATS. If we enable it, the heap is corrupted. I have tried several ways, but it always corrupts the heap. But we can enable MEMP_STATS without any problems.

To do this, put in CMakeLists:

#add_compile_definitions("MEM_STATS" "MEMP_STATS")
add_compile_definitions("MEMP_STATS")

The solution would actually be to limit the amount of memory lwip uses. Looking at the code, I see that there is a possibility. But I haven't tried it and I don't know how to do it.