Closed mchesser closed 9 months ago
Thanks for reporting this. In https://github.com/RIOT-OS/RIOT/pull/8619 it is mentioned that the nano variant of newlib behaves differently, would you mind to check which version is used? Can you reproduce the issue in both cases? (BUILD_IN_DOCKER=1
will use a toolchain that has both nano
and non-nano
variant of the newlib.)
(Note that RIOT will by default use the nano version, but only if that version is available.)
In a different case were racy code did not result in garbled stdio (there seems to be a consensus that this is acceptable in most use cases), but in crashes, we opted to make this thread-safe no matter what. (This was for memory allocation.) So the consistent thing here would be to make sure that __sinit
is called prior before calling main()
.
Also, it would be nice to also provide at least the option to have newlib thread-safe even for functions that do not crash when racing. Some people do like e.g. clean stdio output enough to spend resources on that :)
Took me a while to get back to this, but I've managed to reproduce with a docker based build of the target. Based on the output of test/sys/libc_newlib
and the build script output (see below), this build appears to be using newlib-nano
.
It ended up being quite difficult to get a consistent crash when attached to a debugger. I think this is partly because some of the data allocated as part of stdio initialization is within 'uninitialized' RAM, which happens to be initialized with data from any previous execution.
The method I ended with, was to try and force the firmware to call a null pointer in worker2
by carefully adjusting the delay to ensure that preemption occurs after zeroing the stdout
structure and some initial setup, but before the function pointer for write
is initialized. Checks on the stdout
FILE structure that occur as part of puts
(e.g., SWR
must be set in flags
), mean that preemption needs to within this region of code: findfp.c:60-67.
The null call then occurs as part of worker2->puts->puts_r->__swbuf_r->_fflush_r->__sflush_r
at flush.c:195
Getting this exact crash, required sub-microsecond adjustment which I did via a busy loop, however I think other uses of stdio
may result in a larger window for the crash.
Although sinit
has slightly changed in newer versions newlibc
, because of some fairly recent patches (staring from: https://sourceware.org/pipermail/newlib/2022/019283.html). I'm fairly sure the same problem still occurs (e.g., by causing the switch in a similar location)
After spending some time analyzing this issue, I believe that the actual cause is already known -- i.e., #4488 / #8619-comment-569952641 (and is actually documented in the release notes).
Feel free to close as a duplicate, but I thought I would leave the analysis below regarding the impact of
CONFIG_SKIP_BOOT_MSG
.Description
The reentrant stdio functions in
newlib
initialize some shared data (via the function__sinit
) the first time an IO function, e.g.,puts
, is used. Normally, beforemain
, RIOT prints a message usingputs
(i.e., "This is RIOT" + version), which ends up calling__sinit
before user code is executed, avoiding most of the issues with initialization.However if
CONFIG_SKIP_BOOT_MSG
is set, the__sinit
will not be called until the first print statement. If two threads attempt to print at a roughly similar time, then it is possible1 for the second thread to execute with a partially initializedreent
object which causes various crashes depending on how much of the structure has been initialized.1. Platforms that do not perform any locking as part of
_lock_acquire
(see: #8619-comment-569952641).As an example, if we have code configured like this:
That spawns multiple threads that print -- e.g.:
main.c
```cpp #includeThen if
worker1
is preempted in the middle of stdio initialization, e.g. at:Then
worker2
may crash whenputs
is called: