Open brunohpg opened 5 months ago
I don't think this is happening due to some glitches on UART wires.
It might be theoretically possible that the modem would send a lot of data at once, but according to what I saw, most of the devices refrain from sending more data after 3-4kB (without response), which shouldn't cause too much trouble.
We can of course simulate this condition, and I agree that this would cause trouble; the RAM pbufs are normally allocated on heap and there's nothing preventing from allocating them indefinitely. But I'd like to understand where this issue comes from, before trying to fix it.
lwip provides a very nice stats
module, where you can easily check pbuf allocation statistics. Could you please check that you're really running out of memory because of PPPoS pbuf allocations?
Sure. I don't know how to do it, but I'll look for how to enable stats
and get back to you.
I was reviewing the path to allocate pbuf and I found an interesting point. Check out this small piece of code:
By default, the variable pbuf_alloc_len = 0
. Only when using IP_FORWARD
(and I do!) is there a defined value. From what I can see, the value of the constant should be 1508 bytes, because that's what the system tried to allocate. It's a high value.
At first I imagined that there could be several pppos inputs in the same call, but there isn't. There's only one input for each.
However, there can be multiple PDDATA
within the same pppos input. In fact, they can be joins of multiple inputs as well. The first PDDATA
of each packet will allocate 1508 bytes. So I don't need a very large input packet to get trouble. Nor do I need a very large input. If there are 40 1-byte packets in the same input, probably 60k won't be enough.
Does that make sense?
Some UDP flood will break the system.
I'm running the system connected to the Wi-Fi for one week. There shouldn't have been a problem and there wasn't! I'm not using PPPoS at this point.
I am sending the log for two reasons: the other tasks are not generating memory problems and lwip is not recording memory consumption. I believe that MEM_STATS
and MEMP_STATS
should be active.
I will enable the defines in CMakelist and force the use of pppos. I will come back soon with details.
I (614875685) vRTC: RTC Data: 15/06/2024 19:03:53 (1718478233)
I (614875692) vRTC: Sys Startup: 08/06/2024 16:15:58 (1717863358)
Heap summary for capabilities 0x00001800:
At 0x3ffae6e0 len 6432 free 132 allocated 5680 min_free 4
largest_free_block 88 alloc_blocks 60 free_blocks 3 total_blocks 63
At 0x3ffbf6f0 len 133392 free 932 allocated 129972 min_free 4
largest_free_block 272 alloc_blocks 387 free_blocks 10 total_blocks 397
At 0x3ffe0440 len 15072 free 400 allocated 14192 min_free 4
largest_free_block 192 alloc_blocks 16 free_blocks 4 total_blocks 20
At 0x3ffe4350 len 113840 free 49880 allocated 62808 min_free 37816
largest_free_block 47104 alloc_blocks 70 free_blocks 5 total_blocks 75
Totals:
free 51344 allocated 212652 min_free 37828 largest_free_block 47104
LINK
xmit: 39
recv: 8
fw: 0
drop: 1
chkerr: 0
lenerr: 0
memerr: 0
rterr: 1
proterr: 0
opterr: 0
err: 0
cachehit: 0
ETHARP
xmit: 3891
recv: 64709
fw: 0
drop: 18119
chkerr: 0
lenerr: 0
memerr: 0
rterr: 0
proterr: 18119
opterr: 0
err: 0
cachehit: 48342
IP_FRAG
xmit: 0
recv: 0
fw: 0
drop: 0
chkerr: 0
lenerr: 0
memerr: 0
rterr: 0
proterr: 0
opterr: 0
err: 0
cachehit: 0
IP
xmit: 8709
recv: 35578
fw: 0
drop: 30
chkerr: 0
lenerr: 30
memerr: 0
rterr: 10
proterr: 0
opterr: 0
err: 0
cachehit: 0
IGMP
xmit: 0
recv: 7781
drop: 2249
chkerr: 0
lenerr: 0
memerr: 0
proterr: 0
rx_v1: 0
rx_group: 0
rx_general: 5265
rx_report: 267
tx_join: 0
tx_leave: 0
tx_report: 0
ICMP
xmit: 93
recv: 2967
fw: 0
drop: 10
chkerr: 0
lenerr: 0
memerr: 0
rterr: 0
proterr: 10
opterr: 0
err: 0
cachehit: 0
UDP
xmit: 28576
recv: 20965
fw: 0
drop: 87
chkerr: 0
lenerr: 0
memerr: 0
rterr: 10
proterr: 87
opterr: 0
err: 0
cachehit: 6832
TCP
xmit: 51638
recv: 1335
fw: 0
drop: 1290
chkerr: 0
lenerr: 0
memerr: 0
rterr: 0
proterr: 1290
opterr: 0
err: 0
cachehit: 45
SYS
sem.used: 0
sem.max: 0
sem.err: 0
mutex.used: 0
mutex.max: 0
mutex.err: 0
mbox.used: 0
mbox.max: 0
mbox.err: 0
IP NAPT
nr_active_tcp: 0
nr_active_udp: 0
nr_active_icmp: 0
nr_forced_evictions: 0
I think I am reproducing the same out of memory issue.
I have a similar setup (same IDF, build method and ESP32S3 SOC) and I have activated as well IP forwarding. I'm getting a lot of "pppos_input_tcpip failed with -1" errors. I think these errors are pbuf allocation failures.
My setup is the following: I have a SIMCOM A7602 modem that is driven through UART to get a PPPoS connection. I'm trying to have a remote access to the local network that is connected through Ethernet to the ESP32 (via a W5500 module). That's why I added a wireguard VPN (to have a fix IP) and activated IP forwarding and the NAPT functionalities between the wireguard netif and the ethernet netif.
This setup works well but the connection is awfully slow, due to a large number of memory allocation failures, and it sometime leads to the same heap_caps_alloc_failed crash.
@david-cermak I posted this here because I think our issue are related but I can open a new issue if you think this is not the case.
I tried activating lwip stats but I did not manage to get more data than the one sent by brunohpg. I willl try to activate MEM_STATS and MEMP_STATS to get more logs.
@damiencampanella @david-cermak
I have actived lwip mem stats. But the problem is not generated over time. I did not see a significant increase in memory over time. It is something more acute.
I also did a udp flood and that did not generate the problem. I really do not know what the problem is. Maybe it is something really in the corruption of PPP packets. But this should have a checksum for prevention (assumption: I do not know how PPP packets are formed).
There is a problem with lwip mem stats. We cannot enable MEM_STATS
. If we enable it, the heap is corrupted. I have tried several ways, but it always corrupts the heap. But we can enable MEMP_STATS
without any problems.
To do this, put in CMakeLists:
#add_compile_definitions("MEM_STATS" "MEMP_STATS")
add_compile_definitions("MEMP_STATS")
The solution would actually be to limit the amount of memory lwip uses. Looking at the code, I see that there is a possibility. But I haven't tried it and I don't know how to do it.
Answers checklist.
IDF version.
ESP-IDF v5.2.1
Espressif SoC revision.
ESP32 v3.0
Operating System used.
Windows
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
CMD
Development Kit.
Custom
Power Supply used.
External 5V
What is the expected behavior?
LwIP PPPoS allocates memory through a pool of pBuf. I believe there should be an allocation limit once it is validated whether the pbuf has been allocated or not.
https://github.com/espressif/esp-lwip/blob/f174681317a00d209e374c66d9691ef79b4195de/src/netif/ppp/pppos.c#L681-L683
Therefore, I would expect that PPPoS would not consume all system memory.
What is the actual behavior?
It seems to me that there is no allocation limit through PPPoS. I believe that when there is a glitch in UART reception and bits are exchanged, PPPoS can understand that it must create several packets and generate a lack of memory. It's an assumption.
I think it might not be related to esp-idf, but rather lwip. If anyone has any deeper knowledge of Lwip, they could help me unravel the problem.
I made a printout of available CAP8 memory. There is "a lot" of memory available before the lwip task allocates it to pppos, but even so, it runs out of memory.
Steps to reproduce.
There's no special step. Just use PPPoS. I think we should expect for some uart glitch.
Debug Logs.
More Information.
I do not have the knowledge to know whether changing bits within PPPoS could cause the system to generate a very large number of packets.
There is another assumption: a large amount of information arrived at the same time via PPPoS.
In any case, it is not interesting to depend on PPPoS input. It would be more interesting to lose packets than to restart the system.
PS: My system has other tasks that allocate memory. I always give preference to static memory and run it for days for memory analysis. I believe the problem is not related to other tasks.