espressif / esp-lwip

Fork of lwIP (https://savannah.nongnu.org/projects/lwip/) with ESP-IDF specific patches
Other
83 stars 128 forks source link

[all] etharp assert "no packet queues allowed!" (IDFGH-13386) #71

Open KonssnoK opened 1 month ago

KonssnoK commented 1 month ago

from a45be9e438f6cf9c54ec150581819c3b95d5af6b

==================== CURRENT THREAD STACK =====================
#0  0x403764c5 in panic_abort (details=0x3fcebc3b \"assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed!)\") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x40384db4 in esp_system_abort (details=0x3fcebc3b \"assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed!)\") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x4038dfe0 in __assert_func (file=0x3c1efc20 \"IDF/components/lwip/lwip/src/core/ipv4/etharp.c\", line=<optimized out>, func=<optimized out>, expr=0x3c1efddc \"no packet queues allowed!\") at /opt/esp-idf/components/newlib/assert.c:85
#3  0x420dab40 in etharp_query (netif=0x3d805c5c, ipaddr=<optimized out>, q=0x3d83e4ac) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:1025
#4  0x420dad4c in etharp_output (netif=0x3d805c5c, q=0x3d83e4ac, ipaddr=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:906
#5  0x420db8d8 in ip4_forward (p=0x3d83e4ac, iphdr=<optimized out>, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:439
#6  0x420dbaac in ip4_input (p=0x3d83e4ac, inp=0x3d818ad4) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:715
#7  0x420dd559 in ppp_input (pcb=0x3d82621c, pb=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/ppp.c:899
#8  0x420ddde8 in pppos_input (ppp=0x3d82621c, s=0x3d83d0d9 <error: Cannot access memory at address 0x3d83d0d9>, l=0) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:583
#9  0x420ddf5a in pppos_input_sys (p=0x3d83d0b4, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:454
#10 0x420d13a4 in tcpip_thread_handle_msg (msg=0x3d83ad98) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:180
#11 0x420d1427 in tcpip_thread (arg=0x0) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:154
#12 0x40389085 in vPortTaskWrapper (pxCode=0x420d13ec <tcpip_thread>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

anybody has idea of what could trigger this issue?

KonssnoK commented 1 month ago

maybe the assert is supposed to be

LWIP_ASSERT("no packet queues allowed!", (p->len == p->tot_len) && (p->next == 0)); ?

If the total len is not the len of the packet or there is another packet attached to this one.

KonssnoK commented 1 month ago

i added a log printf("etharp_query %d %d next 0x%08X\n", p->len, p->tot_len, (uint32_t)p->next); right before the assert

this is the crash i get:


I (12544) mesh: 2004<arm>parent monitor, my layer:2(cap:6)(node), interval:7856ms, retries:1<normal connected>
W (01:01:18.232) mesh_main: <MESH_EVENT_PARENT_CONNECTED>layer:0-->2, parent:7CDFA1E2A305<layer2>, ID:777777777776
I (01:01:18.241) mesh_netif: Clearing interface <sta>
I (01:01:18.248) mesh_netif: It was a wifi station, removing handlers
I (01:01:18.255) mesh_netif: Interface MESH
ethernet_output: sending packet 0x3d824de4
I (12593) wifi:<ba-add>idx:0 (ifx:0, 7c:df:a1:e2:a3:05), tid:5, ssn:0, winSize:64
W (01:01:18.277) mesh_main: <MESH_EVENT_TODS_REACHABLE>reachable:0
ethernet_input: dest:ff:ff:ff:ff:ff:ff, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d82501c
ethernet_input: dest:ff:ff:ff:ff:ff:ff, src:7c:df:a1:e2:a3:05, type:800
etharp_find_entry: found empty entry 0
etharp_find_entry: selecting empty entry 0
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d805bb8
etharp_timer
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d805bb8
etharp_find_entry: found matching entry 0
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d805bb8
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d805bb8
W (01:01:19.293) network: GOT IP from mesh_sta
I (01:01:19.295) network: Network connected
I (01:01:19.296) mqtt_app: MQTT_EVENT_BEFORE_CONNECT
I (01:01:19.299) mqtt_app: Try [0] connection to mqtts://mqtt-stg.tiko.energy:8883
etharp_find_entry: found empty entry 1
etharp_find_entry: selecting empty entry 1
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d8256a8
etharp_query 66 66 next 0x00000000

assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed! len)

Backtrace: 0x403764da:0x3fceba90 0x40384dc9:0x3fcebab0 0x4038dfe5:0x3fcebad0 0x420deee9:0x3fcebbf0 0x420df121:0x3fcebc10 0x420e020d:0x3fcebc40 0x420e02b1:0x3fcebc80 0x420dcaf5:0x3fcebcb0 0x420dcb8d:0x3fcebcf0 0x420dcc2d:0x3fcebd10 0x420d5cad:0x3fcebd30 0x420d5d8b:0x3fcebd70 0x420d61c0:0x3fcebd90 0x420d66e5:0x3fcebdc0 0x420e7358:0x3fcebde0 0x420d5291:0x3fcebe00 0x420d533c:0x3fcebe20 0x4038909e:0x3fcebe50
0x403764da: panic_abort at C:/src/esp-idf/components/esp_system/panic.c:408

0x40384dc9: esp_system_abort at C:/src/esp-idf/components/esp_system/esp_system.c:137

0x4038dfe5: __assert_func at C:/src/esp-idf/components/newlib/assert.c:85

0x420deee9: etharp_query at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (discriminator 1)

0x420df121: etharp_output at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:906

0x420e020d: ip4_output_if_opt_src at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:1108 (discriminator 4)

0x420e02b1: ip4_output_if_src at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:926

0x420dcaf5: udp_sendto_if_src at C:/src/esp-idf/components/lwip/lwip/src/core/udp.c:906 (discriminator 4)

0x420dcb8d: udp_sendto_if at C:/src/esp-idf/components/lwip/lwip/src/core/udp.c:705

0x420dcc2d: udp_sendto at C:/src/esp-idf/components/lwip/lwip/src/core/udp.c:612

0x420d5cad: dns_send at C:/src/esp-idf/components/lwip/lwip/src/core/dns.c:921

0x420d5d8b: dns_check_entry at C:/src/esp-idf/components/lwip/lwip/src/core/dns.c:1139

0x420d61c0: dns_enqueue at C:/src/esp-idf/components/lwip/lwip/src/core/dns.c:1580 (discriminator 4)

0x420d66e5: dns_gethostbyname_addrtype at C:/src/esp-idf/components/lwip/lwip/src/core/dns.c:1731

0x420e7358: lwip_netconn_do_gethostbyname at C:/src/esp-idf/components/lwip/lwip/src/api/api_msg.c:2274

0x420d5291: tcpip_thread_handle_msg at C:/src/esp-idf/components/lwip/lwip/src/api/tcpip.c:168 (discriminator 4)

0x420d533c: tcpip_thread at C:/src/esp-idf/components/lwip/lwip/src/api/tcpip.c:154

0x4038909e: vPortTaskWrapper at C:/src/esp-idf/components/freertos/port/xtensa/port.c:142

this is what happens by changing the assert, but i-m not sure this is a correct fix.


W (01:00:08.513) network: GOT IP from mesh_sta
I (01:00:08.515) network: Network connected
I (01:00:08.520) mqtt_app: MQTT_EVENT_BEFORE_CONNECT
I (01:00:08.525) mqtt_app: Try [0] connection to mqtts://mqtt-stg.tiko.energy:8883
etharp_find_entry: found empty entry 1
etharp_find_entry: selecting empty entry 1
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d8256a0
etharp_query 66 66 next 0x00000000
etharp_query: queued packet 0x3d825624 on ARP entry 1
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:806
etharp_update_arp_entry: 10.0.0.1 - 7c:df:a1:e2:a3:05
etharp_find_entry: found matching entry 1
etharp_update_arp_entry: updating stable entry 1
ethernet_output: sending packet 0x3d825624
etharp_input: incoming ARP reply
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d8256c4
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d8256c4
ethernet_output: sending packet 0x3d82cd50
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d825a70
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d82dee4
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d825a28
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d82ffcc
ethernet_output: sending packet 0x3d8302c4
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d82f5bc
etharp_timer
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d830c04
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d82f5bc
I (01:00:09.672) stats: iteration_time_us=1408328 Name=last% IDLE1=88 IDLE0=31 tiT=6 wifi=3 mqtt_task=66 
I (01:00:09.907) regulator: iteration took, 233 ms
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d8312cc
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d82f5bc
etharp_timer
etharp_request: sending ARP request.
etharp_raw: sending raw ARP packet.
ethernet_output: sending packet 0x3d830c4c
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_output: sending packet 0x3d830450
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800
ethernet_input: dest:ac:17:54:00:0a:b0, src:7c:df:a1:e2:a3:05, type:800

https://github.com/lwip-tcpip/lwip/pull/44

KonssnoK commented 1 week ago

any news?

espressif-abhikroy commented 1 week ago

Hi @KonssnoK,

It seems that both logs show: etharp_query 66 66 next 0x00000000, so it looks like it should work even without the change you suggested.

I'm a bit puzzled as to why it would crash. Could you check if there might be any memory corruption issues in your project?

Also, could you share a simple code snippet that we can use to reproduce this problem?

KonssnoK commented 1 week ago

@espressif-abhikroy of course both logs show the same etharp_query 66 66 next 0x00000000 because the failing condition is the same, what changes is the assert. In my tests when investigating the crash i had to split the assert in 2 parts, one checking the length of the packet and one checking the next pointer. This is why in this case we see a crash for 66 66.

Sadly i cannot share code snippets cause the crash is happening only when using LTE PPP on our custom board.

This is how I understood it:

The assert is (p->len != p->tot_len) || (p->next == 0) which means we want to verify the packet to have either next to 0 (not a packet queue) or len different from tot_len. This second check doesn't make sense in my opinion, if the len is not equal to tot_len it means the packet is fragmented and therefore a packet queue.

Did I understood anything wrong?

I have to recheck the crashes because few weeks passed and had to work on other stuff.

espressif-abhikroy commented 1 week ago

@KonssnoK I agree with your understanding. However, I believe this check is primarily for sanitization purposes.

To explain the negative scenario:

The assertion will fail when both of the following conditions are true:

In this case, the check fails because the code expects that if the packet isn't fragmented (i.e., p->len == p->tot_len), there should be no additional packets queued (i.e., p->next == 0). Having a fully contained pbuf with a non-null p->next suggests an improper packet queue, which could lead to unpredictable behavior.

KonssnoK commented 5 days ago

@espressif-abhikroy what i can say is that we have devices in the field having the issue, so it's not simply sanitization for us. I'm thinking this happens for very large installations (more than 10 devices), where packets have to go all through the LTE device).

I agree with your negative scenario, I still have no idea why this was triggering so often on some installations.

KonssnoK commented 5 days ago

as reference i checked the amount of crashes we are cureently having on th field for this reason: attached image image

stack trace extends to:


======================== THREADS INFO =========================
#0  0x403764c9 in panic_abort (details=0x3fcebbeb \"assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed!)\") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x403846b8 in esp_system_abort (details=0x3fcebbeb \"assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed!)\") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x4038d6cc in __assert_func (file=0x3c1e4a40 \"IDF/components/lwip/lwip/src/core/ipv4/etharp.c\", line=<optimized out>, func=<optimized out>, expr=0x3c1e4bfc \"no packet queues allowed!\") at /opt/esp-idf/components/newlib/assert.c:85
#3  0x420d8604 in etharp_query (netif=0x3d825a04, ipaddr=<optimized out>, q=0x3d83d9f8) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:1025
#4  0x420d8810 in etharp_output (netif=0x3d825a04, q=0x3d83d9f8, ipaddr=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:906
#5  0x420d939c in ip4_forward (p=0x3d83d9f8, iphdr=<optimized out>, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:439
#6  0x420d9570 in ip4_input (p=0x3d83d9f8, inp=0x3d825d00) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:715
#7  0x420db01d in ppp_input (pcb=0x3d825e8c, pb=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/ppp.c:899
#8  0x420db8ac in pppos_input (ppp=0x3d825e8c, s=0x3d83ce69 <error: Cannot access memory at address 0x3d83ce69>, l=0) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:583
#9  0x420dba1e in pppos_input_sys (p=0x3d83ce40, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:454
#10 0x420cee68 in tcpip_thread_handle_msg (msg=0x3d83b228) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:180
#11 0x420ceeeb in tcpip_thread (arg=0x0) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:154
#12 0x40388989 in vPortTaskWrapper (pxCode=0x420ceeb0 <tcpip_thread>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142```
KonssnoK commented 21 hours ago

well, i confirm that changing the asset does not solve the issue


==================== CURRENT THREAD STACK =====================
#0  0x403764c5 in panic_abort (details=0x3fcebbeb \"assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed!)\") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x40384db4 in esp_system_abort (details=0x3fcebbeb \"assert failed: etharp_query IDF/components/lwip/lwip/src/core/ipv4/etharp.c:1025 (no packet queues allowed!)\") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x4038dfe0 in __assert_func (file=0x3c1fa748 \"IDF/components/lwip/lwip/src/core/ipv4/etharp.c\", line=<optimized out>, func=<optimized out>, expr=0x3c1fa904 \"no packet queues allowed!\") at /opt/esp-idf/components/newlib/assert.c:85
#3  0x420df502 in etharp_query (netif=0x3d82bd0c, ipaddr=<optimized out>, q=0x3d83ec80) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:1025
#4  0x420df710 in etharp_output (netif=0x3d82bd0c, q=0x3d83ec80, ipaddr=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:906
#5  0x420e029c in ip4_forward (p=0x3d83ec80, iphdr=<optimized out>, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:439
#6  0x420e0470 in ip4_input (p=0x3d83ec80, inp=0x3d825674) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:715
#7  0x420e1f1d in ppp_input (pcb=0x3d825800, pb=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/ppp.c:899
#8  0x420e27ac in pppos_input (ppp=0x3d825800, s=0x3d83e28f <error: Cannot access memory at address 0x3d83e28f>, l=0) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:583
#9  0x420e291e in pppos_input_sys (p=0x3d83e250, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:454
#10 0x420d5d64 in tcpip_thread_handle_msg (msg=0x3d83ca7c) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:180
#11 0x420d5de7 in tcpip_thread (arg=0x0) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:154
#12 0x40389085 in vPortTaskWrapper (pxCode=0x420d5dac <tcpip_thread>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

I'll go back to checking how we can avoid this

espressif-abhikroy commented 17 hours ago

@KonssnoK, It appears that there may not be an issue within etharp_query(). However, it is possible that the previous layers are not handling the pbuf correctly in certain scenarios, which might be causing the assertion in etharp_query().

Since you're unable to provide code to reproduce the issue, could you try printing the values of p->len, p->tot_len, and p->next in each function along the call stack, starting from pppos_input_sys()? Please share the logs from an affected device. This could help in identifying the root cause of the problem.