Open fjmolinas opened 2 years ago
Brainstorming for solutions: I am somewhat wondering how https://github.com/RIOT-OS/RIOT/pull/17265 could have caused this... the shell and its prompt is controlled by the shell
module, so there is at least some layers of abstraction between the bug and the cause: stdio
but also the stdio
handler of ethos
. So why is this happening?
Flash an application using
stdio_ethos
, e.g.:examples/suit_update
,examples/gnrc_border_router
.
This does not seem to be the case in general. The changes in #17265 worked fine on the IoT-LAB M3 (and at least @chrysn and @benpicco do not seem to encounter this issue either). Which boards were you using for that.
Flash an application using
stdio_ethos
, e.g.:examples/suit_update
,examples/gnrc_border_router
.This does not seem to be the case in general. The changes in #17265 worked fine on the IoT-LAB M3 (and at least @chrysn and @benpicco do not seem to encounter this issue either). Which boards were you using for that.
nrf52840-mdk
, dwm1001
the ones I have with me right now...
I also saw it on iotlab-m3
a couple of times but quite rate, I don't think I saw it on samr21-xpro
...
As I am unable to reliably reproduce this on boards I have at hand at the moment (or rather not at all), I think some remote debugging is required... From the output you provided, it looks to me that the shell is running into this condition:
This is caused by either EOT or EOF sent by stdin
So if this bug is indeed stdio_ethos
related, those characters (\x04
and \xff
) should show up after this function call:
Could you maybe add a check like this:
if ((tmp > 0) && ((*ptr == 0xff) || (*ptr == 0x04))) {
/* do something (e.g. toggle LED) */
}
Then at least we would see if the issue is from here.
Can reproduce with feather-nrf52840dk
! Will try to debug.
I have been running into this problem lately relative often. If this helps, I've seen it happening under this conditions:
In my use case I have an nrf52850dk gateway running ethos and a DSME interface. I forward all DSME traffic through ethos but at the same time I print some MAC metrics when scheduling/transmitting a frame. In order to avoid the issue, I need to set some timers between commands.
I might be wrong, but I think I've seen these kind of issues even before https://github.com/RIOT-OS/RIOT/pull/17265, but since most of the times ethos run on a samr21-xpro (or equivalent), we probably didn't catch it before.
In my use case I have an nrf52850dk gateway running ethos and a DSME interface. I forward all DSME traffic through ethos but at the same time I print some MAC metrics when scheduling/transmitting a frame. In order to avoid the issue, I need to set some timers between commands.
For me, just hitting the on-board reset button also just stops it from happening.
Digging a bit through the code, it seems _read_r
in sys/newlib_syscalls_default/syscalls.c
only seems to be called once. After that, getchar
just always returns EOF
.
Was able to debug getchar
down to
Thread 2 hit Breakpoint 2, __srget_r (ptr=ptr@entry=0x20000274 <impure_data>, fp=fp@entry=0x20005b74) at ../../../../../../../../newlib-4.2.0.20211231/newlib/libc/stdio/rget.c:40
40 ../../../../../../../../newlib-4.2.0.20211231/newlib/libc/stdio/rget.c: No such file or directory.
Recursive internal problem.
make: *** [/home/mlenders/Repositories/RIOT-OS/RIOT/examples/gnrc_border_router/../../Makefile.include:884: debug] Terminated
Terminated
:confused:
Now I suddenly can't reproduce anymore :astonished:
Now I suddenly can't reproduce anymore astonished
back in business... appearently make flash term
seems to cause it, while "make flash; sleep 1; make term" seems to work (sometimes).
back in business... appearently make flash term seems to cause it, while "make flash; sleep 1; make term" seems to work (sometimes).
Same here! I usually trigger a make reset
before make term
in the experiment scripts to avoid that.
But this seems to be more of a newlib
+stdio
problem than a ethos_stdio
issue. I suspect (but could not show yet) that the nope ... that's not itFILE
objects for the re-entrants are somewhat broken, whenever newlib
calls CHECK_INIT()
Using picolibc also seems to work as a workaround:
Grr nope with that I get a flaky stdin
...
> h
help
el
lp
p
Command Description
---------------------------------------
reboot Reboot the node
version Prints current RIOT_VERSION
pm interact with layered PM subsystem
ps Prints information about running threads.
ping6 Ping via ICMPv6
ping Alias for ping6
nib Configure neighbor information base
ifconfig Configure network interfaces
6ctx 6LoWPAN context configuration tool
(literally had to type h<enter>e<enter>l<enter>p<enter>
to get that output...
if you connect to /dev/ttyACM0 directly after it gets in the loop (instead of using ethos), does it also print endless ">"?
if you connect to /dev/ttyACM0 directly after it gets in the loop (instead of using ethos), does it also print endless ">"?
If I do
$ PROGRAMMER=jlink BOARD=feather-nrf52840 PORT=/dev/ttyUSB0 TERMPROG=picocom TERMFLAGS="--nolock --imap lfcrlf --baud 115200 /dev/ttyUSB0" make -C examples/gnrc_border_router -j flash term
no.
Was able to debug
getchar
down toThread 2 hit Breakpoint 2, __srget_r (ptr=ptr@entry=0x20000274 <impure_data>, fp=fp@entry=0x20005b74) at ../../../../../../../../newlib-4.2.0.20211231/newlib/libc/stdio/rget.c:40 40 ../../../../../../../../newlib-4.2.0.20211231/newlib/libc/stdio/rget.c: No such file or directory. Recursive internal problem. make: *** [/home/mlenders/Repositories/RIOT-OS/RIOT/examples/gnrc_border_router/../../Makefile.include:884: debug] Terminated Terminated
:confused:
Btw: it's result already is -1 (i.e. EOF) in _getc_r()
Interesting: the fp
pointer in _getc_r()
has (right after flash) already help\n
stored in its _p
member:
>>> print *fp
$1 = {
_p = 0x20005e78 "help\n",
_r = 0,
_w = 0,
_flags = 164,
_file = 0,
_bf = {
_base = 0x20005e78 "help\n",
_size = 1024
},
_lbfsize = 0,
_data = 0x0 <__retarget_lock_init>,
_cookie = 0x200058c4,
_read = 0x1016d <__sread>,
_write = 0x1018f <__swrite>,
_seek = 0x101c7 <__sseek>,
_close = 0x101eb <__sclose>,
_ub = {
_base = 0x0 <__retarget_lock_init>,
_size = 0
},
_up = 0x0 <__retarget_lock_init>,
_ur = 0,
_ubuf = "\000\000",
_nbuf = "",
_lb = {
_base = 0x0 <__retarget_lock_init>,
_size = 0
},
_blksize = 0,
_offset = 0,
_lock = 0x0 <__retarget_lock_init>,
_mbstate = {
__count = 0,
__value = {
__wch = 0,
__wchb = "\000\000\000"
}
},
_flags2 = 0
}
Seems to be whatever was last entered into the terminal (even after a restart of make flash term
).
Ok, following the trace doesn't seem to make any sense (most likely due to optimization?). Going down the rabbit hole, I land here at some point according to gdb in __srefill_r
.
/*
* We were reading. If there is an ungetc buffer,
* we must have been reading from that. Drop it,
* restoring the previous buffer (if any). If there
* is anything in that buffer, return.
*/
if (HASUB (fp))
{
FREEUB (ptr, fp);
if ((fp->_r = fp->_ur) != 0)
{
fp->_p = fp->_up;
return 0;
}
}
resulting in __sgetc_r()
to return -1 here.
int
getc (register FILE *fp)
{
int result;
struct _reent *reent = _REENT;
CHECK_INIT (reent, fp);
_newlib_flockfile_start (fp);
result = __sgetc_r (reent, fp);
_newlib_flockfile_end (fp);
return result;
}
However, this doesn't make any sense, as both as a pre- and post-condition to __srefill_r
neither is fs->_ub.base != NULL
(the condition under which HASUB(fp)
is true, nor is (fp->r = fp->_ur) != 0
.
All in all, I am pretty much at a loss what might be the problem. Given that this error happens so deep into newlib
's internals, I tend to agree with @jia200x's comment: https://github.com/RIOT-OS/RIOT/pull/17265 seems not to be the cause, it just made it more obvious due to the performance increase.
Using picolibc also seems to work as a workaround:
Sadly not a workaround I can use with BLE border routers, since for some reason NimBLE does not work with Picolibc (at least not the border router)
Description
Since https://github.com/RIOT-OS/RIOT/pull/17265 usage of
stdio_ethos
will result in a continuously exiting shell and therefore a barrage of prompts with the shell unusable.This issue is solved if rebooting (hardware reboot) the device, but it renders automatic tests like
examples/suit_update
impossible.Steps to reproduce the issue
Flash an applicationn using
stdio_ethos
, e.g.:examples/suit_update
,examples/gnrc_border_router
.Expected results
Shell boots up fine.
Actual results
Versions