Open gschorcht opened 5 years ago
@maribu Even if you are unfamiliar with ESP8266, I appreciate your in-depth knowledge of compiler, optimization, and the NewLib. So I'd be very grateful if you could take a look at the three issues I found while testing the reimplementation of the ESP8266 port. I can provide you with more details if necessary. Finally, I would like to discuss what the best way would be to fix them.
- The lock functions in
cpu/esp8266/syscalls.c
don't work when an application calls anewlib
function from an ISR which uses_lock_acquire
or_log_acquire_recursive
to be thread-safe. The reason is that the Implementation of these lock functions usesmutex
andrmutex
which do not work in the interrupt context.
Possible solutions and their pros and cons as far as I can see:
DEBUG()
statements are done from interrupt context, so this is likely to cause huge pain when debugging. If other platforms support syscalls from interrupt context (even though the same issue should apply there!), this will cause confusionmutex_trylock()
) and let the syscall fail if this doesn't succeed
irq_handler
or tasklet), if called from interrupt context
(I have a meeting now, I'll continue later.)
- The lock functions in
cpu/esp8266/syscalls.c
don't work when an application calls anewlib
function from an ISR which uses_lock_acquire
or_log_acquire_recursive
to be thread-safe. The reason is that the Implementation of these lock functions usesmutex
andrmutex
which do not work in the interrupt context.
@maribu Thanks for taking a first look. Unfortunately, a number of these options are not in our hand because puts
is realized by the newlib
which also calls back the _lock_acquire*
functions from the puts
function. See my comments inline.
- Check on each syscall if they are called from interrupt context, and fail if so
Because puts
and all other functions are realized by the newlib
we have no chance to do it in that way. The according locking functions tried to solve it with an assert https://github.com/RIOT-OS/RIOT/blob/6afeea7c99355e74e9f4844f0720680528790d24/cpu/esp8266/syscalls.c#L181 which at least points the developer that there is problem. Unfortunately, the assert itself doesn't work because it uses printf
from the newlib
which leads to the same problem :worried:
- In thread context, lock the mutex as always. In interrupt context, try to lock the mutex (via
mutex_trylock()
) and let the syscall fail if this doesn't succeed
newlib
's _lock_acquire*
functions have no return value. newlib
calls these functions and simply expects that they can't fail. For the case, that the newlib
expects that the lock function can fail, are separate _lock_try_acquire*
functions which return a value that is checked by the newlib
. But in the case of puts
and other functions these _lock_try_acquire*
functions are not used.
- Disable interrupts instead of using a mutex
Saving the correct interrupt state over several calls of these locking function is not an easy task because the newlib
uses different lock
objects for different tasks and sometimes multiple lock
objects in the same function, e.g., the malloc
lock and the file lock in printf
.
- Delegate the work to threads (such as those provided by the event handle thread,
irq_handler
or tasklet), if called from interrupt context
I don't see how to realize it in that way.
BTW, when I deactivate all locking functions, everything works fine but isn't thread-safe. I bet that most of the RIOT platforms that use the newlib
don't realize that locking functions.
With
-Os
, char arrays have not to be 32-bit word aligned. This leads to an alignment exception when the address of a char array is assigned to anuint32_t
pointer and the pointer is used for the access.
This should also be a problem on ARM then, it doesn't support unaligned access either.
With
-Os
, char arrays have not to be 32-bit word aligned. This leads to an alignment exception when the address of a char array is assigned to anuint32_t
pointer and the pointer is used for the access.This should also be a problem on ARM then, it doesn't support unaligned access either.
Obviously not. I have tested tests/pkg_qdsa
on a nucleo-f411re
board. The global variable sm
https://github.com/RIOT-OS/RIOT/blob/5a705762ee03301f8ba3a854dda0310fd16ed64b/tests/pkg_qdsa/main.c#L30 is placed @ 0x00000000200014e9
. This variable is used as parameter for function sign
, there for function hash
and there for function extractFromState
. Finally, it is assigned to an uint32_t *
pointer variable in extractFromState
for faster copy operation. The unaligned 32-bit access works on an STM32F4, I have debugged it.
Casting a pointer with a low alignment requirement (like uint8_t *
) to one with a higher alignment requirement (link uint32_t *
) and then accessing memory using the latter pointer is undefined behaviour. And not all platform support these, including some ARM versions.
So this is definitely a bug in pkg_qdsa that needs to be fixed. Once the bug is fixed, no work around is needed.
This by the way again shows that diversity in the platform RIOT supports is inherently valuable, as the bug wouldn't have shown up without.
* ROM `strlen` function crashes for NULL pointer.
I believe this is absolutely fine, as NULL
is not pointing to a string and the documentation says the argument has to be a pointer to a string. Some random guys on stackoverflow also agree with this and so does GCC:
~ $ cat test.c
#include <string.h>
int main(void)
{
strlen(NULL);
return 0;
}
~ $ gcc -Os -Wall -std=c99 -o test test.c
test.c: In function 'main':
test.c:5:2: warning: null argument where non-null required (argument 1) [-Wnonnull]
5 | strlen(NULL);
| ^~~~~~
* Watchdog timer
Could you map this to RIOTs WDT API? If I understand this correctly, the current WDT is transparently handled without the code in RIOT being aware of this. I think this approach simply has its limits and the easiest way would be to just only provide a watchdog timer, if the RIOT application asks for it and is prepared to feed it regularly.
* Watchdog timer
Could you map this to RIOTs WDT API?
Thanks for the hint. This API is new, I did not know about it until now. I am not sure if this mapping is possible because I do not know if it is possible to avoid initializing the WDT at all. I have to check.
Reboot hangs if esp_wifi is associated with an AP until the watchdog timer hard resets the esp8266.
I tried to debug this. This hangs on an assert() in xtimer because long_offset is not 0: https://github.com/RIOT-OS/RIOT/blob/9256970517798bd844f4ee1dc4cddd9f61cef6e1/sys/xtimer/xtimer_core.c#L279
#0 core_panic (crash_code=crash_code@entry=PANIC_ASSERT_FAIL, message=0x3ffe9318 <assert_crash_message> "FAILED ASSERTION.") at core/panic.c:91
#1 0x40103904 in _update_short_timers (now=<optimized out>) at sys/xtimer/xtimer_core.c:279
#2 _timer_callback () at sys/xtimer/xtimer_core.c:323
#3 _periph_timer_callback (arg=<optimized out>, chan=<optimized out>) at sys/xtimer/xtimer_core.c:130
#4 0x40100e4a in hw_timer_handler (arg=<optimized out>) at cpu/esp8266/periph/timer.c:112
#5 0x40108728 in _xt_lowint1 () at cpu/esp_common/vendor/xtensa/xtensa_vectors.S:1317
This stack trace is not relevant. The reason it fails is because the timer
points to the main thread's stack (assuming you called "reboot" from the main thread) on a function that added a xtimer_t defined in the stack and then returned. The stack was then re-used for something else and long_offset now contains garbage.
The stack trace at the call is roughly (line numbers not exact because I added some debug to obtain this):
#1 0x40100887 in thread_yield_higher () at cpu/esp_common/thread_arch.c:287
#2 0x40210d44 in _block (irq_state=1, mutex=0x3ffef7cc <main_stack+2460>) at core/mutex.c:67
#3 mutex_lock (mutex=mutex@entry=0x3ffef7cc <main_stack+2460>) at core/mutex.c:85
#4 0x4010404c in _xtimer_tsleep (offset=100000, long_offset=long_offset@entry=0) at sys/xtimer/xtimer.c:68
#5 0x40102c82 in _xtimer_tsleep32 (ticks=<optimized out>) at sys/include/xtimer/implementation.h:162
#6 xtimer_usleep (microseconds=<optimized out>) at sys/include/xtimer/implementation.h:181
#7 vTaskDelay (xTicksToDelay=10) at cpu/esp_common/freertos/task.c:138
#8 0x40211f64 in task_delay_wrapper (tick=<optimized out>) at cpu/esp8266/vendor/esp-idf/esp8266/source/esp_wifi_os_adapter.c:84
#9 0x40239505 in ieee80211_sta_new_state ()
#10 0x40231f7e in esp_wifi_disconnect ()
#11 0x4023a314 in wifi_station_stop ()
#12 0x4023070a in esp_restart ()
#13 0x4021946c in system_restart () at cpu/esp8266/sdk/system.c:45
_xtimer_tsleep
does a mutex_lock, adds the timer, then mutex_lock again on the same mutex. When the timer triggers it should unlock the timer.
https://github.com/RIOT-OS/RIOT/blob/9256970517798bd844f4ee1dc4cddd9f61cef6e1/sys/xtimer/xtimer.c#L63
However, the second mutex_lock doesn't block and the _xtimer_tsleep
returns leaving the stack timer
in the pending list.
When thread_yield_higher
runs for the second call to mutex_lock, irq_is_in()
returns 0, the ETS_SOFT_INUM (7) shows as enabled and set in special registers INTENABLE and INTERRUPT in the debugger, but It doesn't seem to fire and the context switch never happens. The last condition would be to check the CINTLEVEL, PS register is 0x21, so PS.INTLEVEL=1 and PS.UM=1 (but PS.EXCM is 0). I don't think we are running from an interrupt here (and irq_is_in()
returns 0) but the INTLEVEL is 1 when task_delay_wrapper is called and 0 at the beginning when system_restart is called. esp_wifi_disconnect() disables the interrupts with a rsil
instruction so anything that calls esp_wifi_disconnect()
will crash this way.
I see a few options:
vTaskDelay
if not enabled (
https://github.com/RIOT-OS/RIOT/blob/9256970517798bd844f4ee1dc4cddd9f61cef6e1/cpu/esp_common/freertos/task.c#L133 ) (probably bad idea)I don't know what you expect if interrupts are disabled and you call vTaskDelay. @gschorcht WDYT?
Description
When testing the reimplementation of the ESP866 port based on the ESP8266 RTOS SDK in PR #11108, the following open problems were identified (https://github.com/RIOT-OS/RIOT/pull/11108#issuecomment-552115885):
[ ] Lock functions for newlib
The lock functions in
cpu/esp8266/syscalls.c
don't work if an application calls anewlib
function from an ISR and thenewlib
uses_lock_acquire
or_log_acquire_recursive
to be thread-safe. The reason is that the esp8266 implementation of these lock functions usesmutex
andrmutex
which do not work in the interrupt context. For example,puts
is used intests/isr_yield_higher
from an ISR:make BOARD=esp8266-esp-12x -C tests/isr_yield_higher flash test
Workaround: lock functions are disabled for the moment https://github.com/RIOT-OS/RIOT/blob/6afeea7c99355e74e9f4844f0720680528790d24/cpu/esp8266/syscalls.c#L167 https://github.com/RIOT-OS/RIOT/blob/6afeea7c99355e74e9f4844f0720680528790d24/cpu/esp8266/syscalls.c#L185
[ ] Watchdog timer
The watchdog timer will reset if threads take a long time to complete without yielding and if no interrupts occur. For example,
tests/libfixmath
is doing time consuming calculation within a loop without yielding.make BOARD=esp8266-esp-12x -C tests/libfixmath flash test
Workaround: Watchdog timer is disabled at the moment. https://github.com/RIOT-OS/RIOT/blob/5b002b860f473be48f04e956dc8a7a4f9f3ce8ad/cpu/esp8266/vendor/esp-idf/esp8266/source/task_wdt.c#L30
[ ] Compile Option
-Os
With
-Os
, char arrays have not to be 32-bit word aligned. This leads to an alignment exception when the address of a char array is assigned to anuint32_t
pointer and the pointer is used for the access. This problem could occur in any part of the code at any time. For example,tests/pkg_qdsa
defines an array that is used in cryptographic functions after assigning their addresses touint32_t
pointers:make BOARD=esp8266-esp-12x -C tests/pkg_qda flash test
Workaround:
-O2
is used for optimization. https://github.com/RIOT-OS/RIOT/blob/5b002b860f473be48f04e956dc8a7a4f9f3ce8ad/cpu/esp8266/Makefile.include#L126[ ] ROM
strlen
function crashes for NULL pointer.[ ] Reboot hangs if
esp_wifi
is associated with an AP until the watchdog timer hard resets the esp8266.[ ] Module
esp_gdbstub
doesn't work since it redefines theputchar
function provided by the newlib.[x] Messages from vendor WiFI libraries have to be disabled. These are messages like
(fixed by PR #12998)
[x] After the merge of PR #10499,
esp-now
no longer gets a link-local address by default (fixed by PR #13561)[ ] Erasing a sector in SPI flash hangs if WiFi module
esp_wifi
is enabled (issue #16281)Expected results
The test applications should without the workarounds.
Actual results
The test applications hang or crash.
Versions