[TRACKING] cpu/esp8266: Open problems of the ESP8266 port based on the ESP8266 RTOS SDK

RIOT-OS / RIOT

RIOT - The friendly OS for IoT

https://riot-os.org

GNU Lesser General Public License v2.1

4.96k stars 1.99k forks source link

[TRACKING] cpu/esp8266: Open problems of the ESP8266 port based on the ESP8266 RTOS SDK #12707

Open gschorcht opened 5 years ago

gschorcht commented 5 years ago

Description

When testing the reimplementation of the ESP866 port based on the ESP8266 RTOS SDK in PR #11108, the following open problems were identified (https://github.com/RIOT-OS/RIOT/pull/11108#issuecomment-552115885):

[ ] Lock functions for newlib

The lock functions in cpu/esp8266/syscalls.c don't work if an application calls a newlib function from an ISR and the newlib uses _lock_acquire or _log_acquire_recursive to be thread-safe. The reason is that the esp8266 implementation of these lock functions uses mutex and rmutex which do not work in the interrupt context. For example, puts is used in tests/isr_yield_higher from an ISR:

make BOARD=esp8266-esp-12x -C tests/isr_yield_higher flash test

Workaround: lock functions are disabled for the moment https://github.com/RIOT-OS/RIOT/blob/6afeea7c99355e74e9f4844f0720680528790d24/cpu/esp8266/syscalls.c#L167 https://github.com/RIOT-OS/RIOT/blob/6afeea7c99355e74e9f4844f0720680528790d24/cpu/esp8266/syscalls.c#L185
[ ] Watchdog timer

The watchdog timer will reset if threads take a long time to complete without yielding and if no interrupts occur. For example, tests/libfixmath is doing time consuming calculation within a loop without yielding.

make BOARD=esp8266-esp-12x -C tests/libfixmath flash test

Workaround: Watchdog timer is disabled at the moment. https://github.com/RIOT-OS/RIOT/blob/5b002b860f473be48f04e956dc8a7a4f9f3ce8ad/cpu/esp8266/vendor/esp-idf/esp8266/source/task_wdt.c#L30
[ ] Compile Option -Os

With -Os, char arrays have not to be 32-bit word aligned. This leads to an alignment exception when the address of a char array is assigned to an uint32_t pointer and the pointer is used for the access. This problem could occur in any part of the code at any time. For example, tests/pkg_qdsa defines an array that is used in cryptographic functions after assigning their addresses to uint32_t pointers:

make BOARD=esp8266-esp-12x -C tests/pkg_qda flash test

Workaround: -O2 is used for optimization. https://github.com/RIOT-OS/RIOT/blob/5b002b860f473be48f04e956dc8a7a4f9f3ce8ad/cpu/esp8266/Makefile.include#L126
[ ] ROM strlen function crashes for NULL pointer.
[ ] Reboot hangs if esp_wifi is associated with an AP until the watchdog timer hard resets the esp8266.
[ ] Module esp_gdbstub doesn't work since it redefines the putchar function provided by the newlib.
[x] Messages from vendor WiFI libraries have to be disabled. These are messages like
```
add if0
add if1
bcn 100
scandone
```
(fixed by PR #12998)
[x] After the merge of PR #10499, esp-now no longer gets a link-local address by default (fixed by PR #13561)

[ ] Erasing a sector in SPI flash hangs if WiFi module esp_wifi is enabled (issue #16281)

CFLAGS=-DCONFIG_USE_HARDWARE_MTD USEMODULE="esp_spiffs" make BOARD=esp8266-esp-12x -C tests/pkg_spiffs flash

Expected results

The test applications should without the workarounds.

Actual results

The test applications hang or crash.

Versions

gschorcht commented 5 years ago

@maribu Even if you are unfamiliar with ESP8266, I appreciate your in-depth knowledge of compiler, optimization, and the NewLib. So I'd be very grateful if you could take a look at the three issues I found while testing the reimplementation of the ESP8266 port. I can provide you with more details if necessary. Finally, I would like to discuss what the best way would be to fix them.

maribu commented 5 years ago

The lock functions in cpu/esp8266/syscalls.c don't work when an application calls a newlib function from an ISR which uses _lock_acquire or _log_acquire_recursive to be thread-safe. The reason is that the Implementation of these lock functions uses mutex and rmutex which do not work in the interrupt context.

Possible solutions and their pros and cons as far as I can see:

Check on each syscall if they are called from interrupt context, and fail if so
- Pros: Safe, simple, and syscalls should be used from there to not sacrifice real time capabilities. (E.g. RIOT does not allow nested interrupts as far as I know.)
- Cons: A significant amount of DEBUG() statements are done from interrupt context, so this is likely to cause huge pain when debugging. If other platforms support syscalls from interrupt context (even though the same issue should apply there!), this will cause confusion
In thread context, lock the mutex as always. In interrupt context, try to lock the mutex (via mutex_trylock()) and let the syscall fail if this doesn't succeed
- Pros: Safe, simple, syscalls will work most of the time from interrupt context
- Cons: Sporadically lost syscalls seems pretty much unacceptable to me
Disable interrupts instead of using a mutex
- Pros: Even more simple, safe, and syscalls work fine regardless of context
- Cons: Likely breaks with all real time requirements, will cause lost interrupts for drivers etc.
Delegate the work to threads (such as those provided by the event handle thread, irq_handler or tasklet), if called from interrupt context
- Pros: Safe, reliable, keeps real time capabilities, can be an optional feature with falling back to letting all syscalls fail in IRQ context if not enabled
- Cons: Increases RAM & ROM footprint, complex

(I have a meeting now, I'll continue later.)

gschorcht commented 5 years ago

The lock functions in cpu/esp8266/syscalls.c don't work when an application calls a newlib function from an ISR which uses _lock_acquire or _log_acquire_recursive to be thread-safe. The reason is that the Implementation of these lock functions uses mutex and rmutex which do not work in the interrupt context.

@maribu Thanks for taking a first look. Unfortunately, a number of these options are not in our hand because puts is realized by the newlib which also calls back the _lock_acquire* functions from the puts function. See my comments inline.

Check on each syscall if they are called from interrupt context, and fail if so

Because puts and all other functions are realized by the newlib we have no chance to do it in that way. The according locking functions tried to solve it with an assert https://github.com/RIOT-OS/RIOT/blob/6afeea7c99355e74e9f4844f0720680528790d24/cpu/esp8266/syscalls.c#L181 which at least points the developer that there is problem. Unfortunately, the assert itself doesn't work because it uses printf from the newlib which leads to the same problem :worried:

In thread context, lock the mutex as always. In interrupt context, try to lock the mutex (via mutex_trylock()) and let the syscall fail if this doesn't succeed

newlib's _lock_acquire* functions have no return value. newlib calls these functions and simply expects that they can't fail. For the case, that the newlib expects that the lock function can fail, are separate _lock_try_acquire* functions which return a value that is checked by the newlib. But in the case of puts and other functions these _lock_try_acquire* functions are not used.

Disable interrupts instead of using a mutex

Saving the correct interrupt state over several calls of these locking function is not an easy task because the newlib uses different lock objects for different tasks and sometimes multiple lock objects in the same function, e.g., the malloc lock and the file lock in printf.

Delegate the work to threads (such as those provided by the event handle thread, irq_handler or tasklet), if called from interrupt context

I don't see how to realize it in that way.

BTW, when I deactivate all locking functions, everything works fine but isn't thread-safe. I bet that most of the RIOT platforms that use the newlib don't realize that locking functions.

benpicco commented 5 years ago

With -Os, char arrays have not to be 32-bit word aligned. This leads to an alignment exception when the address of a char array is assigned to an uint32_t pointer and the pointer is used for the access.

This should also be a problem on ARM then, it doesn't support unaligned access either.

gschorcht commented 5 years ago

With -Os, char arrays have not to be 32-bit word aligned. This leads to an alignment exception when the address of a char array is assigned to an uint32_t pointer and the pointer is used for the access.

This should also be a problem on ARM then, it doesn't support unaligned access either.

Obviously not. I have tested tests/pkg_qdsa on a nucleo-f411re board. The global variable sm https://github.com/RIOT-OS/RIOT/blob/5a705762ee03301f8ba3a854dda0310fd16ed64b/tests/pkg_qdsa/main.c#L30 is placed @ 0x00000000200014e9. This variable is used as parameter for function sign, there for function hash and there for function extractFromState. Finally, it is assigned to an uint32_t * pointer variable in extractFromState for faster copy operation. The unaligned 32-bit access works on an STM32F4, I have debugged it.

maribu commented 5 years ago

Casting a pointer with a low alignment requirement (like uint8_t *) to one with a higher alignment requirement (link uint32_t *) and then accessing memory using the latter pointer is undefined behaviour. And not all platform support these, including some ARM versions.

So this is definitely a bug in pkg_qdsa that needs to be fixed. Once the bug is fixed, no work around is needed.

This by the way again shows that diversity in the platform RIOT supports is inherently valuable, as the bug wouldn't have shown up without.

maribu commented 5 years ago

* ROM `strlen` function crashes for NULL pointer.

I believe this is absolutely fine, as NULL is not pointing to a string and the documentation says the argument has to be a pointer to a string. Some random guys on stackoverflow also agree with this and so does GCC:

~ $ cat test.c
#include <string.h>

int main(void)
{
    strlen(NULL);
    return 0;
}
~ $ gcc -Os -Wall -std=c99 -o test test.c 
test.c: In function 'main':
test.c:5:2: warning: null argument where non-null required (argument 1) [-Wnonnull]
    5 |  strlen(NULL);
      |  ^~~~~~

maribu commented 5 years ago

* Watchdog timer

Could you map this to RIOTs WDT API? If I understand this correctly, the current WDT is transparently handled without the code in RIOT being aware of this. I think this approach simply has its limits and the easiest way would be to just only provide a watchdog timer, if the RIOT application asks for it and is prepared to feed it regularly.

gschorcht commented 5 years ago

* Watchdog timer
Could you map this to RIOTs WDT API?

Thanks for the hint. This API is new, I did not know about it until now. I am not sure if this mapping is possible because I do not know if it is possible to avoid initializing the WDT at all. I have to check.

iosabi commented 3 years ago

Reboot hangs if esp_wifi is associated with an AP until the watchdog timer hard resets the esp8266.

I tried to debug this. This hangs on an assert() in xtimer because long_offset is not 0: https://github.com/RIOT-OS/RIOT/blob/9256970517798bd844f4ee1dc4cddd9f61cef6e1/sys/xtimer/xtimer_core.c#L279

#0  core_panic (crash_code=crash_code@entry=PANIC_ASSERT_FAIL, message=0x3ffe9318 <assert_crash_message> "FAILED ASSERTION.") at core/panic.c:91
#1  0x40103904 in _update_short_timers (now=<optimized out>) at sys/xtimer/xtimer_core.c:279
#2  _timer_callback () at sys/xtimer/xtimer_core.c:323
#3  _periph_timer_callback (arg=<optimized out>, chan=<optimized out>) at sys/xtimer/xtimer_core.c:130
#4  0x40100e4a in hw_timer_handler (arg=<optimized out>) at cpu/esp8266/periph/timer.c:112
#5  0x40108728 in _xt_lowint1 () at cpu/esp_common/vendor/xtensa/xtensa_vectors.S:1317

This stack trace is not relevant. The reason it fails is because the timer points to the main thread's stack (assuming you called "reboot" from the main thread) on a function that added a xtimer_t defined in the stack and then returned. The stack was then re-used for something else and long_offset now contains garbage.

The stack trace at the call is roughly (line numbers not exact because I added some debug to obtain this):

#1  0x40100887 in thread_yield_higher () at cpu/esp_common/thread_arch.c:287
#2  0x40210d44 in _block (irq_state=1, mutex=0x3ffef7cc <main_stack+2460>) at core/mutex.c:67
#3  mutex_lock (mutex=mutex@entry=0x3ffef7cc <main_stack+2460>) at core/mutex.c:85
#4  0x4010404c in _xtimer_tsleep (offset=100000, long_offset=long_offset@entry=0) at sys/xtimer/xtimer.c:68
#5  0x40102c82 in _xtimer_tsleep32 (ticks=<optimized out>) at sys/include/xtimer/implementation.h:162
#6  xtimer_usleep (microseconds=<optimized out>) at sys/include/xtimer/implementation.h:181
#7  vTaskDelay (xTicksToDelay=10) at cpu/esp_common/freertos/task.c:138
#8  0x40211f64 in task_delay_wrapper (tick=<optimized out>) at cpu/esp8266/vendor/esp-idf/esp8266/source/esp_wifi_os_adapter.c:84
#9  0x40239505 in ieee80211_sta_new_state ()
#10 0x40231f7e in esp_wifi_disconnect ()
#11 0x4023a314 in wifi_station_stop ()
#12 0x4023070a in esp_restart ()
#13 0x4021946c in system_restart () at cpu/esp8266/sdk/system.c:45

_xtimer_tsleep does a mutex_lock, adds the timer, then mutex_lock again on the same mutex. When the timer triggers it should unlock the timer. https://github.com/RIOT-OS/RIOT/blob/9256970517798bd844f4ee1dc4cddd9f61cef6e1/sys/xtimer/xtimer.c#L63

However, the second mutex_lock doesn't block and the _xtimer_tsleep returns leaving the stack timer in the pending list. When thread_yield_higher runs for the second call to mutex_lock, irq_is_in() returns 0, the ETS_SOFT_INUM (7) shows as enabled and set in special registers INTENABLE and INTERRUPT in the debugger, but It doesn't seem to fire and the context switch never happens. The last condition would be to check the CINTLEVEL, PS register is 0x21, so PS.INTLEVEL=1 and PS.UM=1 (but PS.EXCM is 0). I don't think we are running from an interrupt here (and irq_is_in() returns 0) but the INTLEVEL is 1 when task_delay_wrapper is called and 0 at the beginning when system_restart is called. esp_wifi_disconnect() disables the interrupts with a rsil instruction so anything that calls esp_wifi_disconnect() will crash this way.

I see a few options:

re-enable interrupts in vTaskDelay if not enabled ( https://github.com/RIOT-OS/RIOT/blob/9256970517798bd844f4ee1dc4cddd9f61cef6e1/cpu/esp_common/freertos/task.c#L133 ) (probably bad idea)
Busy-loop the CPU in vTaskDelay if called when interrupts not enabled.
just skip the xtimer call in vTaskDelay if called when interrupts not enabled.

I don't know what you expect if interrupts are disabled and you call vTaskDelay. @gschorcht WDYT?