esp-rs / esp-idf-svc

Type-Safe Rust Wrappers for various ESP-IDF services (WiFi, Network, Httpd, Logging, etc.)
https://docs.esp-rs.org/esp-idf-svc/
Apache License 2.0
291 stars 164 forks source link

Restricting tasks to CPU0 breaks async tls #296

Open torkleyy opened 11 months ago

torkleyy commented 11 months ago

If I restrict threads to run on CPU0:

CONFIG_ESP_MAIN_TASK_AFFINITY=CPU0
CONFIG_PTHREAD_TASK_CORE_DEFAULT=0
CONFIG_PTHREAD_DEFAULT_CORE_0=y

the AsyncEspTls::negotiate won't become ready.

I would expect the main thread (async_io::block_on) to yield at some point to let the IO thread make progress.

ivmarkov commented 11 months ago

Might be caused by this terrible hack which is trying to mitigate this oddity in esp-tls which is masking out whether mbedtls wants us to wait on the socket to become readable, or writeable.

Can you increase the delay from 0, to - say - 20 (= 20ms), and then remove the await on the socket to become writable? Shooting in the dark here, but if that turns out to be the culprit, I'm at a loss how to implement a proper workaround.

ivmarkov commented 11 months ago

Additional info: IF the main task had higher priority than the pthread-based "poll my sockets" task which async-io silently runs in the background, that delay_ms(0) wouldn't help, as the main task gives the rest of its time-slice only to same or higher-priority tasks.

I hope you haven't changed your main task priority. It is by default with priority 1, which is only higher than the idle tasks which run at prio 0. pthread tasks on the other hand by default run with priority 5, which is a higher priority.

But if you DID change your main task prio to something higher than 5... that would explain what is going on there...

torkleyy commented 11 months ago

I cannot currently check anything, but I'm sure my main task priority is the default one.

ivmarkov commented 11 months ago

It will take me a couple of days to check myself, but I guess, no rush.

torkleyy commented 10 months ago

This is the stack trace when the watchdog interrupts the thread:

0x3FFC0160 - port_IntStack
    at ??:??
0x40083CDD - _xt_lowint1
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel/portable/xtensa\xtensa_vectors.S:1236
0x3FFC0190 - port_IntStack
    at ??:??
0x4021A6AB - esp_vfs_safe_fd_isset
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/vfs\vfs.c:838
0x3FFDEEC0 - _bss_end
    at ??:??
0x4017322E - esp_vfs_select
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/vfs\vfs.c:929
0x3FFDEEE0 - _bss_end
    at ??:??
0x4016F153 - poll
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\poll.c:63
0x3FFDEF50 - _bss_end
    at ??:??
0x40125A04 - polling::poll::poll
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\poll.rs:445
0x3FFDEFA0 - _bss_end
    at ??:??
0x401270F4 - polling::Poller::wait
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\lib.rs:394
0x3FFDF180 - _bss_end
    at ??:??
0x4011DA5C - async_io::reactor::ReactorLock::react
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\reactor.rs:270
0x3FFDF1F0 - _bss_end
    at ??:??
0x4011FD61 - async_io::driver::main_loop
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:65
0x3FFDF2B0 - _bss_end
    at ??:??
0x4011EAB4 - async_io::driver::unparker::{{closure}}::{{closure}}
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:32
0x3FFDF330 - _bss_end
    at ??:??
0x4011F1E7 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
    at C:\Users\me\.rustup\toolchains\esp\lib\rustlib\src\rust\library\std\src\thread\mod.rs:529
0x3FFDF350 - _bss_end
    at ??:??

I've increased the delay in the EWOULDBLOCK_I32 case. With a delay of 20, the watchdog still triggered, but the stack trace changed:

0x3FFC0160 - port_IntStack
    at ??:??
0x40083CDD - _xt_lowint1
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel/portable/xtensa\xtensa_vectors.S:1236
0x3FFC0190 - port_IntStack
    at ??:??
0x4000BFED - _xtos_set_intlevel
    at ??:??
0x3FFDF6A0 - _heap_start
    at ??:??
0x400943D2 - vPortClearInterruptMaskFromISR
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos\portmacro.h:568
0x3FFDF6B0 - _heap_start
    at ??:??
0x400930EF - xQueueGenericSend
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel\queue.c:882
0x3FFDF6D0 - _heap_start
    at ??:??
0x40083F8A - lock_release_generic
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\locks.c:192
0x3FFDF710 - _heap_start
    at ??:??
0x4008403D - _lock_release
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\locks.c:198
0x3FFDF740 - _heap_start
    at ??:??
0x4017315E - esp_vfs_select
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/vfs\vfs.c:932
0x3FFDF760 - _heap_start
    at ??:??
0x4016F067 - poll
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\poll.c:63
0x3FFDF7D0 - _heap_start
    at ??:??
0x40125918 - polling::poll::poll
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\poll.rs:445
0x3FFDF820 - _heap_start
    at ??:??
0x40127008 - polling::Poller::wait
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\lib.rs:394
0x3FFDFA00 - _heap_start
    at ??:??
0x4011D970 - async_io::reactor::ReactorLock::react
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\reactor.rs:270
0x3FFDFA70 - _heap_start
    at ??:??
0x4011FC75 - async_io::driver::main_loop
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:65
0x3FFDFB30 - _heap_start
    at ??:??
0x4011E9C8 - async_io::driver::unparker::{{closure}}::{{closure}}
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:32
0x3FFDFBB0 - _heap_start
    at ??:??
0x4011F0FB - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
    at C:\Users\me\.rustup\toolchains\esp\lib\rustlib\src\rust\library\std\src\thread\mod.rs:529

Interestingly, I've had one case where it did manage to do TLS negotation (and afterwards communicated normally with the server). However, it took abnormally long (7 seconds) and I wasn't able to reproduce it a second time.

ivmarkov commented 10 months ago

@torkleyy TL;DR: If you wrap your async_io::block_on call in another thread, like this:

    thread::Builder::new().stack_size(15000).spawn(move || {
        smol::block_on(<... your code driving AsyncEspTls ...>)
    })?
    .join()
    .unwrap()

... it would work.

Long story: still to get to the exact root cause, but in the meantime pretty sure the problem is caused by the combination of two factors:

As a result, both the main task, as well as the idle tasks are actually starving and never being scheduled. So - in a way - the EspTls code never gets a chance to run, even though its socket is ready!

The workaround of running your code in a pthread task works, because this way you raise the priority of your task to 5, and allow it to pre-empt the busy-looping async-io hidden task, as ESP IDF allows time-slicing across tasks with equal priorities.

I have a deja-vu feeling that I'm discovering this for the second time, and have experienced this on the single-core esp32c3 a while ago. I think there is a reason why I've put all the async code in rust-esp32-std-demo in pthread tasks, and not in the main task. :)

torkleyy commented 10 months ago

Ha! I see. Thanks a bunch for all the information!