Open benpicco opened 2 years ago
I have seen these issues in the past with the gnrc_netif/submac integration.. With the last commits that should have been fixed.
Does this also occur if you don't use slipdev
? If not, it could point to some leftovers on that direction...
Which "last commits" are you referring to?
It also happens with DOSE, but I think a second interface was required to trigger this (will try to reproduce this with a single one)
Which "last commits" are you referring to?
make -C examples/gnrc_border_router BOARD=nrf52840dk UPLINK=slip -j flash term
I think this could be an effect of having two interfaces running with the same priority. Considering that slip
is order of magnitude slower than ieee802154
, there are high chances the IEEE 802.15.4 are not met. This is probably triggering a corner case we haven't seen under normal operation...
I think the problem is either in netdev_ieee802154_submac
(not to confuse with ieee802154_submac
) or in the gnrc_netif
pktqueue integration.
Last but not least, I've seen "lost" events as a result of https://github.com/RIOT-OS/RIOT/pull/16899. This could also be the source
make -C examples/gnrc_border_router BOARD=nrf52840dk UPLINK=slip -j flash term
Mh, I don't have that board and trying to reproduce with samr21-xpro
yields the expected results. Trying on a similar particle-xenon
I face the problem, that it uses cdc_acm_stdio
. That reminded me: slipdev
is only working with UART and IIRC the nrf52840dk
used segger-rtt for the stdio (or is was changed?) If you give me a more detailed setup, maybe I manage to reproduce with the Xenon as well.
I think this could be an effect of having two interfaces running with the same priority.
@benpicco could you try if running the interfaces at different PIDs priorities solves the problem?
Mh, I don't have that board and trying to reproduce with
samr21-xpro
yields the expected results.
at86rf2xx
does not use gnrc_pktqueue
Trying on a similar particle-xenon I face the problem, that it uses
cdc_acm_stdio
.
Can you use it together with usbus_cdc_ecm
?
IIRC the
nrf52840dk
used segger-rtt for the stdio
Why would it, the debugger exposes a /dev/ttyACM0
that is connected to the UART pins of the board.
IIRC the
nrf52840dk
used segger-rtt for the stdioWhy would it, the debugger exposes a
/dev/ttyACM0
that is connected to the UART pins of the board.
Then IARW (I apparently remembered wrong) ;-)
Can you use it together with
usbus_cdc_ecm
?
I don't get an immediate error: packet buffer full
, but I can see when using the gnrc_pktbuf_cmd
module, that there is a leak in the packet buffer.
It's not immediate, it takes several minutes, I don't have found a quick way yet to trigger the condition.
Can you confirm, that after the first execution of ping
there is garbage in the packet buffer? This is what I see
pktbuf
2022-04-21 16:07:36,384 # pktbuf
2022-04-21 16:07:36,387 # packet buffer: first byte: 0x200038b4, last byte: 0x200050b4 (size: 6144)
2022-04-21 16:07:36,388 # position of last byte used: 1016
2022-04-21 16:07:36,388 # =========== chunk 0 (0x200038b4 size: 120) ===========
2022-04-21 16:07:36,389 # 00000000 14 39 00 20 B4 39 00 20 51 00 00 00 01 00 00 00
2022-04-21 16:07:36,390 # 00000010 00 00 00 00 00 00 00 00 00 00 00 00 E4 38 00 20
2022-04-21 16:07:36,391 # 00000020 14 00 00 00 01 00 00 00 FF 52 00 59 11 40 FE 80
2022-04-21 16:07:36,391 # 00000030 06 06 06 00 00 00 00 80 EE 5D DA A4 A4 E9 EE 5D
2022-04-21 16:07:36,392 # 00000040 DA A4 A4 EB 00 00 EC 5D CC 38 00 20 84 39 00 20
2022-04-21 16:07:36,393 # 00000050 28 00 00 00 01 00 00 00 02 61 00 01 00 0A 00 03
2022-04-21 16:07:36,394 # 00000060 FC 38 00 20 AC 39 00 20 08 00 00 00 01 00 00 00
2022-04-21 16:07:36,394 # 00000070 04 A4 3E C4 8C 16 45 6E
2022-04-21 16:07:36,395 # ~ unused: 0x2000392c (next: 0x20003a0c, size: 88) ~
2022-04-21 16:07:36,395 # =========== chunk 1 (0x20003984 size: 136) ===========
2022-04-21 16:07:36,396 # 00000000 60 0E 91 52 00 59 11 40 FE 80 00 00 00 00 00 00
2022-04-21 16:07:36,397 # 00000010 72 CD 6F 2F 0C 4C 35 40 FE 80 00 00 00 00 00 00
2022-04-21 16:07:36,398 # 00000020 EC 5D DA FF FE A4 A4 EB 02 23 02 22 00 59 D6 C0
2022-04-21 16:07:36,399 # 00000030 07 DC EE 61 00 01 00 0A 00 03 00 01 EE 5D DA A4
2022-04-21 16:07:36,399 # 00000040 A4 EB 00 02 00 0E 00 01 00 01 27 A4 3E C4 8C 16
2022-04-21 16:07:36,400 # 00000050 45 6E 7A 1E 00 19 00 29 00 19 00 07 00 00 27 10
2022-04-21 16:07:36,401 # 00000060 00 00 4E 20 00 1A 00 19 00 00 75 30 00 00 9C 40
2022-04-21 16:07:36,401 # 00000070 40 20 01 0D B8 00 00 00 02 00 00 00 00 00 00 00
2022-04-21 16:07:36,402 # 00000080 00 00 00 00 B0 16 00 00
2022-04-21 16:07:36,402 # ~ unused: 0x20003a0c (next: (nil), size: 5800) ~
From just a quick scan, I think that's at least one response from the host: the start of chunk1 looks like an IPv6 header with two link local unicast addresses. Since I pinged multicast, it can't be the echo request, and the source and destination addresses confirm that this comes from the host. If this is indeed the case for you as well, I don't believe, its gnrc_pktqueue
(since the NIB also uses gnrc_pktqueue
that could have been a hint as well, as we do not have any known issues there in that regard, but maybe the gnrc_netif_pktq
user, however....), as the packet queue should handle outgoing packets...
I'm afraid this might not be related to gnrc_netif_pktq
but to pktbuf in general with concurrent allocation and freeing:
I have a board with a SLIP and an atwinc15x0
interface. Data is received on the SLIP interface and stored on SD card, when a WiFi network is available data is read from the SD card and send over the WiFi interface.
There is no gnrc_netif_pktq
involved here, but after a few minutes of both interfaces sending and receiving data at the same time, no more pktsnips can be allocated and data ingress stops because of that.
My suspicion is with a fragmentation issue, as small snips can still be allocated (the data that is received / sent makes use of large (~1k) frames).
My suspicion is with a fragmentation issue, as small snips can still be allocated (the data that is received / sent makes use of large (~1k) frames).
Is it possible to provide a dump of gnrc_pktbuf_cmd
when the error condition occurs?
I now have a quite reliable way to trigger the condition, requires sending on one interface while also receiving data on a another interface. pktbuf remains full after I stop the sending board .
I've added #18102 for dual-entry bookkeeping, result is pktbuf stats.txt
, pktbuf leases.txt
I'm not quite sure what to make out of that though.
I've added some more tracing to also log the place where the pktbuf allocation was made pktbuf.log
:
10b63
: sys/net/gnrc/netif/ethernet/gnrc_netif_ethernet.c:182
10ba3
: sys/net/gnrc/netif/ethernet/gnrc_netif_ethernet.c:217 (discriminator 8)
10bcf
: sys/net/gnrc/netif/ethernet/gnrc_netif_ethernet.c:240
13473
: sys/net/gnrc/transport_layer/udp/gnrc_udp.c:124
d78d
: sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:781
(I switched from atwinc15x0
to sam0_eth
for the uplink to rule out an issue with the WiFi driver)
Ok, this turns out to be an unrelated issue. The sending thread did not call sock_udp_recv()
even though it was getting replies to the local port, filling up the mbox.
This can be solved by calling
while (sock_udp_recv(&sock, buf, sizeof(buf), 0, NULL) > 0) {}
after every sock_udp_send()
.
Since gnrc_sock
is the only user of mbox, this is not related to the leak in gnrc_netif_pktq
.
Ok, this turns out to be an unrelated issue. The sending thread did not call
sock_udp_recv()
even though it was getting replies to the local port, filling up the mbox.This can be solved by calling
while (sock_udp_recv(&sock, buf, sizeof(buf), 0, NULL) > 0) {}
after every
sock_udp_send()
. Sincegnrc_sock
is the only user of mbox, this is not related to the leak ingnrc_netif_pktq
.
So the test application is no longer relevant?
Yes the (unrelated to this issue) issue triggered is now resolved by #18143 and #18147 - but gnrc_netif_pktq
is still leaky.
In case of slip
it has to do with this packet hold:
/* hold in case device was busy to not having to rewrite *all* the link
* layer implementations in case `gnrc_netif_pktq` is included */
gnrc_pktbuf_hold(pkt, 1);
It holds the full packet (gnrc netif header + icmp + ipv6 header + ...)
Then we pass the packet to the netif send operation
int res = netif->ops->send(netif, pkt);
which in case of slipdev
is raw_ops::send()
, where we remove the gnrc netif header
if (pkt->type == GNRC_NETTYPE_NETIF) {
/* we don't need the netif snip: remove it */
pkt = gnrc_pktbuf_remove_snip(pkt, pkt);
}
In gnrc_pktbuf_remove_snip
we free the gnrc_netif header and also set its next pointer to NULL
.
gnrc_pktsnip_t *gnrc_pktbuf_remove_snip(gnrc_pktsnip_t *pkt,
gnrc_pktsnip_t *snip)
{
pkt = gnrc_pkt_delete(pkt, snip);
snip->next = NULL;
gnrc_pktbuf_release(snip);
return pkt;
}
The netif driver deletes the remaining packet snips after the netif header, after sending
res = dev->driver->send(dev, (iolist_t *)pkt);
if (gnrc_netif_netdev_legacy_api(netif)) {
/* only for legacy drivers we need to release pkt here */
gnrc_pktbuf_release(pkt);
}
Back in gnrc_netif.c::_send()
we go into _tx_done()
and try to delete the previously held packet
if (gnrc_netif_netdev_new_api(netif)) {
/* with new netdev (with confirm_send), the netif remains owner of the
* pkt and is in charge of releasing it once TX is completed */
gnrc_pktbuf_release_error(pkt, err);
}
But here pkt
is the gnrc_netif header where we set the next
pointer to NULL.
So the held packets after the gnrc netif header are not deleted. Their reference counter has just been decreased once by the netif driver.
We could maybe prohibit to remove a packet snip which is not the first snip in the chain.
Hm but gnrc_netif_netdev_legacy_api()
and gnrc_netif_netdev_new_api()
can't both be true for the same interface.
AFAIK slipdev does not implement the confirm send API yet, so the condition in the last paragraph should not occur.
Sorry, then it was in the else
above
This still happens with the border router app.
Only traffic are some pings between a 6lo node (fd35:ea7:b70b:2200:ac8d:fee1:60ed:99c4
) and the linux host (fdea:dbee:f::1
)
The packet buffer keeps filling up and after a short while the border router is no longer able to send any packets.
> ping ff02::1
error: packet buffer full
error: packet buffer full
error: packet buffer full
--- ff02::1 PING statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
Description
Steps to reproduce the issue
Use
gnrc_netif_pktq
on a board with two interfaces (e.g.nrf802154
&slipdev
) e.g. by runningExpected results
The node can send and receive packets.
Actual results
After a few minutes the node can no longer send any packets
The node never recovers until it is rebooted.
Versions
RIOT master