RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.97k stars 1.99k forks source link

nanocoap: problem with [re]transmission behaviour: (some blocks are sent twice immediately) #18217

Closed fabian18 closed 2 years ago

fabian18 commented 2 years ago

Description

During the testing of #17962, I discovered that something in the transmission behaviour of nanocoap is not working correctly. Using e.g. tests/nanocoap_cli: client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata, to create a resource on a server. The first transmission succeeds, but if you repeat the command, a retransmission is also triggered immediately. Even though the first transmission succeeds, the first call to sock_udp_sendv() in nanocoap_sock_request_cb() returns the expected value, but the message is not sent out (?). At least it does not appear in wireshark. Then the first call to sock_udp_recv_buf() always returns -ETIMEDOUT. But then in the second call to sock_udp_sendv(), the message is sent out and appears in wireshark.

Steps to reproduce the issue

Expected results

The nanocoap client should not immediately trigger a retransmission, but wait for [ACK_TIMEOUT, ACK_TIMEOUT * RANDOM_FACTOR], before it triggers the first retransmission.

Actual results

Most of the time, the retransmission happens immediately, if the server is not very fast with the ACK. nanocoap_cli_put.pcapng.gz

Versions

Linux 5.18.3-arch1-1 (but that´s not relevant)

benpicco commented 2 years ago

Hm this seems to be one of those problems that go away when setting ENABLE_DEBUG to 1 :weary:

fabian18 commented 2 years ago

With ENABLE_DEBUG=1 in nanocoap/sock.c, I get a failed assertion:

> client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata
2022-06-22 11:22:45,749 # client put fe80::3478:9157:9fd5:81c3%5 5683 /somepath somedata
2022-06-22 11:22:45,752 # nanocli: sending msg ID 1, 25 bytes
2022-06-22 11:22:45,755 # nanocoap: send 25 bytes (5 tries left)
2022-06-22 11:22:45,760 # nanocoap: waiting for response (timeout: 2092058 µs)
2022-06-22 11:22:45,762 # nanocoap: response code=204
2022-06-22 11:22:45,767 # nanocoap: waiting for response (timeout: 2085058 µs)
2022-06-22 11:22:45,773 # sys/net/gnrc/pktbuf/gnrc_pktbuf.c:93 => *** RIOT kernel panic:
2022-06-22 11:22:45,774 # FAILED ASSERTION.
2022-06-22 11:22:45,775 # 
2022-06-22 11:22:45,776 # *** halted.
2022-06-22 11:22:45,776 # 
2022-06-22 11:23:22,860 # Exiting Pyterm
assert(gnrc_pktbuf_contains(pkt));
benpicco commented 2 years ago

Wait what :fearful:

Is that with the current master? When I try

% coap-server-notls -A fe80::90a7:a6ff:fe4b:2e32%tapbr0 -d 1

and

> client put fe80::90a7:a6ff:fe4b:2e32 5683 /somepath somedata

I get

client put fe80::90a7:a6ff:fe4b:2e32 5683 /somepath somedata
nanocli: sending msg ID 1, 25 bytes
nanocoap: send 25 bytes (5 tries left)
nanocoap: waiting for response (timeout: 2448721 µs)
nanocoap: response code=201
nanocoap: waiting for response (timeout: 2447721 µs)
nanocli: response Success, code 2.01, empty payload
fabian18 commented 2 years ago

Is that with the current master?

master 1638606c5c01b24d967e5630fa43efad5437add9

You have to send the request multiple times. Your last comment shows only 2.01 Created.

fabian18 commented 2 years ago

and with current master branch, yes

benpicco commented 2 years ago

I don't see this on native, is this on stm32?

fabian18 commented 2 years ago

yes on stm32

benpicco commented 2 years ago

Could also be an out of bounds write on the Ethernet driver, corrupting the pktbuf free list.

can you try with https://github.com/RIOT-OS/RIOT/pull/18227

fabian18 commented 2 years ago

can you try with https://github.com/RIOT-OS/RIOT/pull/18227

CFLAGS+=-DCONFIG_GNRC_PKTBUF_CHECK_USE_AFTER_FREE=1 makes no difference, the same assertion is triggered.

benpicco commented 2 years ago

What 'works' surprisingly well is

--- a/sys/net/application_layer/nanocoap/sock.c
+++ b/sys/net/application_layer/nanocoap/sock.c
@@ -174,7 +174,7 @@ ssize_t nanocoap_sock_request_cb(nanocoap_sock_t *sock, coap_pkt_t *pkt,
         case STATE_RESPONSE_RCVD:
         case STATE_RESPONSE_OK:
             if (ctx == NULL) {
-                DEBUG("nanocoap: waiting for response (timeout: %"PRIu32" µs)\n",
+                printf("nanocoap: waiting for response (timeout: %"PRIu32" µs)\n",
                       _deadline_left_us(deadline));
             }
             const void *old_ctx = ctx;

However I don't see an immediate re-transmission in the error case, instead the node appears to wait for the timeout and never gets the response. I do see the response in Wireshark however when sniffing directly on the node's interface. (DOSE, with #18163)

benpicco commented 2 years ago

So my issue was caused by a different thread being scheduled every 10 ms (and calling ztimer_set each time), which in turn caused the UART RX interrupt to be delayed enough that it would only be executed after another byte was received already. The lost byte caused a CRC mismatch and the frame to be dropped, even though it was complete on the wire.

Adding the printf (or a busy loop, really) before calling sock_udp_recv_buf() meant that the other thread did not get scheduled and the RX interrupts would occur in time. But the proper fix is of course to pause all other threads when downloading firmware updates (since improving RIOTs real-time grantees is certainly no easy task).

I assume the behavior you saw was then caused by issue described in #18416 - so can we close this now?

fabian18 commented 2 years ago

Good news: With #18416 no assertion fires anymore, and it is functioning very well.

But still, on the second attempt to upload a file, some blocks are transmitted twice immediately. With ENABLE_DEBUG=1 it is Block 1 and without it is Block 0. So there still is something odd, but CoAP makes it work nevertheless.

fabian18 commented 2 years ago

so can we close this now?

I would keep it open, because it is only partially fixed.

fabian18 commented 2 years ago

fixed with #18969