esp-rs / esp-idf-svc

Type-Safe Rust Wrappers for various ESP-IDF services (WiFi, Network, Httpd, Logging, etc.)
https://docs.esp-rs.org/esp-idf-svc/
Apache License 2.0
333 stars 183 forks source link

EspHttpWsDetachedSender::send() can fail in CondVar::notify_all() due a race condition and dereference a null pointer at the condition variable #341

Closed mike1703 closed 10 months ago

mike1703 commented 11 months ago

I have a http server running on my ESP32 with a websocket endpoint created with ws_handler() which stores the result of ws_connection.create_detached_sender() in my app state to be able to regularly send some data with this

detached_sender.send(FrameType::Text(false), data.as_bytes())

Sometimes it fails with:

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.                                                                             

Core  0 register dump:                                                                                                                                         
PC      : 0x4008436f  PS      : 0x00060d30  A0      : 0x800844f0  A1      : 0x3ffe2200                                                                         
A2      : 0x00000000  A3      : 0xffffffff  A4      : 0x00000004  A5      : 0x3ffdc98c                                                                         
A6      : 0x00000001  A7      : 0xb33fffff  A8      : 0x00000000  A9      : 0x00000000                     
A10     : 0x00000001  A11     : 0x00000000  A12     : 0x8008d27e  A13     : 0x3ffd3430                                                                         
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x00000013  EXCCAUSE: 0x0000001c                                                                         
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff                                                                         

Backtrace: 0x4008436c:0x3ffe2200 0x400844ed:0x3ffe2230 0x4014f525:0x3ffe2250 0x40135dd3:0x3ffe2270 0x4017bbf1:0x3ffe2290 0x4017c3c2:0x3ffe22d0 0x4017c4a2:0x3ffe2320 0x4008cf95:0x3ffe2340

0x4008436c: lock_acquire_generic at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:111
0x400844ed: _lock_acquire_recursive at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:158
0x4014f525: pthread_cond_broadcast at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/pthread/pthread_cond_var.c:82
0x40135dd3: esp_idf_svc::private::mutex::RawCondvar::notify_all at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/esp-idf-svc-0.47.3/src/private/mutex.rs:107
 (inlined by) <esp_idf_svc::private::mutex::RawCondvar as embedded_svc::utils::mutex::RawCondvar>::notify_all at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/esp-idf-svc-0.47.3/src/private/mutex.rs:148
 (inlined by) embedded_svc::utils::mutex::Condvar<V>::notify_all at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/embedded-svc-0.26.4/src/utils/mutex.rs:176
 (inlined by) esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/esp-idf-svc-0.47.3/src/http/server.rs:1250
0x4017bbf1: httpd_process_ctrl_msg at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:224 (discriminator 9)
0x4017c3c2: httpd_server at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:292 (discriminator 9)
0x4017c4a2: httpd_thread at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:327 (discriminator 15)
0x4008cf95: vPortTaskWrapper at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162

The code at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:111

111: static int IRAM_ATTR lock_acquire_generic(_lock_t *lock, uint32_t delay, uint8_t mutex_type) {
112:    SemaphoreHandle_t h = (SemaphoreHandle_t)(*lock);

The resulting coredump provides the following information

backtrace full
#0  lock_acquire_generic (lock=0x0, delay=4294967295, mutex_type=4 '\004') at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:112
        h = <optimized out>
        __func__ = "lock_acquire_generic"
        success = <optimized out>
#1  0x400844f0 in _lock_acquire_recursive (lock=0x0) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:158
No locals.
#2  0x4014f528 in pthread_cond_broadcast (cv=0x3ffd3304) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/pthread/pthread_cond_var.c:82
        res = 0
        cond = 0x0
        entry = <optimized out>
#3  0x40135dd6 in esp_idf_svc::private::mutex::RawCondvar::notify_all (self=0x0) at src/private/mutex.rs:107
        r = <optimized out>
        left_val = <optimized out>
        right_val = <optimized out>
        kind = <optimized out>
#4  esp_idf_svc::private::mutex::{impl#10}::notify_all (self=0x0) at src/private/mutex.rs:148
No locals.
#5  embedded_svc::utils::mutex::Condvar<esp_idf_svc::private::mutex::RawCondvar>::notify_all<esp_idf_svc::private::mutex::RawCondvar> (self=0x0) at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/embedded-svc-0.26.4/src/utils/mutex.rs:176
No locals.
#6  esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue (arg=0x3ffd32e8) at src/http/server.rs:1250
        ret = 0
        request = 0x3ffd32e8
#7  0x4017bbf4 in httpd_process_ctrl_msg (hd=<optimized out>) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:224
        msg = {hc_msg = HTTPD_CTRL_WORK, hc_work = 0x40135d90 <esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue>, hc_work_arg = 0x3ffd32e8}
        ret = <optimized out>
        __func__ = "httpd_process_ctrl_msg"
#8  0x4017c3c5 in httpd_server (hd=0x3ffde298) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:292
        read_set = {__fds_bits = {0, 16777216}}
        tmp_max_fd = 58
        maxfd = <optimized out>
        __func__ = "httpd_server"
        active_cnt = <optimized out>
        context = {fdset = 0x3ffe22e0, hd = 0x3ffde298}
#9  0x4017c4a5 in httpd_thread (arg=0x3ffde298) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:327
        ret = <optimized out>
        hd = 0x3ffde298
        __func__ = "httpd_thread"
#10 0x4008cf98 in vPortTaskWrapper (pxCode=0x4017c47c <httpd_thread>, pvParameters=0x3ffde298) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
        pcTaskName = <optimized out>

What I noticed is that the CondVar seems to be always 0x0 once it crashes, leading to pthread_cond_broadcast() be feed with a pointer to a 0x0 ending with lock_acquire_generic() dereferencing a 0x0 (and LoadProhibited). I still try to create a minimal example showing this behavior but was not yet able to reproduce the error but this often happens only after a few hours or a day in my real project. I have two threads reading and writing from two UARTs and several other threads doing other things providing data to a central component which sends data via this detached sender. Not sure how these other threads could interfere.

The EspHttpWsDetachedSender::send() function seems to create and drop the CondVar which EspHttpWsDetachedSender::enqueue() uses in the httpd_thread in between so it should not be invalid.

I don't understand why there can be a dereference of 0x0 but to me it looks like there is a bug lurking somewhere in esp-idf-svc.

mike1703 commented 10 months ago

I changed my release build with:

[profile.release]
opt-level = 1
debug-assertions = true

I managed to get another crash with the same behavior but different trace.

#0  0x400831f9 in panic_abort (details=0x3ffe208b "assert failed: xQueueSemaphoreTake queue.c:1657 (pxQueue->uxItemSize == 0)") at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_system/panic.c:452
No locals.
#1  0x4008a040 in esp_system_abort (details=0x3ffe208b "assert failed: xQueueSemaphoreTake queue.c:1657 (pxQueue->uxItemSize == 0)") at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_system/port/esp_system_chip.c:84
No locals.
#2  0x400920dc in __assert_func (file=0x3f41cc06 "queue.c", line=<optimized out>, func=<optimized out>, expr=0x3f41cfa0 "pxQueue->uxItemSize == 0") at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/assert.c:81
        addr = "\000\000\000\000\000\000\000\000\000\000"
        buff = "assert failed: xQueueSemaphoreTake queue.c:1657 (pxQueue->uxItemSize == 0)\000\000\000\000\000\000\000\f\000\000\000#\v\006\000 !\376?@\276\027@p!\376?#\v\006\000 !\376?@\276\027@p!\376?#\v\006\000|\264\374?\001\000\000\000\377\377?\263 ?\b@\360\277\000@0\005\006\000\205\303\b\200\320!\376?\000\000\000\000#\005\006\000 \005\006\000#\005\006\000\377\377?\263\377\377?\263\001\000\000\000\004\000\000\000#\005\006", '\000' <repeats 11 times>
        lbuf = "1657"
        rem_len = 125
        off = <optimized out>
        str = {0x3ffb5150 "assert failed: ", 0x3f41d13c <__func__$9> "xQueueSemaphoreTake", 0x3ffb59ac " ", 0x3f41cc06 "queue.c", 0x3ffb5160 ":", 0x3ffe2153 "1657", 0x3ffb5164 " (", 0x3f41cfa0 "pxQueue->uxItemSize == 0", 0x3ffb5038 ")"}
#3  0x4008aa09 in xQueueSemaphoreTake (xQueue=0x3ffdcd38, xTicksToWait=<optimized out>) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/queue.c:1657
        xEntryTimeSet = 0
        xTimeOut = {xOverflowCount = 0, xTimeOnEntering = 0}
        pxQueue = 0x3ffdcd38
        xInheritanceOccurred = 0
        __func__ = "xQueueSemaphoreTake"
#4  0x4008ab27 in xQueueTakeMutexRecursive (xMutex=0x3ffdcd38, xTicksToWait=4294967295) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/queue.c:783
        xReturn = <optimized out>
        pxMutex = 0x3ffdcd38
        __func__ = "xQueueTakeMutexRecursive"
#5  0x400843e8 in lock_acquire_generic (lock=0x3ffdc6fc, delay=4294967295, mutex_type=4 '\004') at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:144
        h = 0x3ffdcd38
        __func__ = "lock_acquire_generic"
        success = <optimized out>
#6  0x400844f0 in _lock_acquire_recursive (lock=0x3ffdc6fc) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:158
No locals.
#7  0x40159d14 in pthread_cond_broadcast (cv=0x3ffd33b4) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/pthread/pthread_cond_var.c:82
        res = 0
        cond = 0x3ffdc6fc
        entry = <optimized out>
#8  0x4013bec0 in esp_idf_svc::private::mutex::RawCondvar::notify_all (self=0x0) at src/private/mutex.rs:107
No locals.
#9  esp_idf_svc::private::mutex::{impl#10}::notify_all (self=0x0) at src/private/mutex.rs:148
No locals.
#10 embedded_svc::utils::mutex::Condvar<esp_idf_svc::private::mutex::RawCondvar>::notify_all<esp_idf_svc::private::mutex::RawCondvar> (self=0x0) at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/embedded-svc-0.26.4/src/utils/mutex.rs:176
No locals.
#11 esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue (arg=0x3ffd3398) at src/http/server.rs:1250
        ret = 0
        request = 0x3ffd3398
#12 0x40183884 in httpd_process_ctrl_msg (hd=<optimized out>) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:224
        msg = {hc_msg = HTTPD_CTRL_WORK, hc_work = 0x4013be74 <esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue>, hc_work_arg = 0x3ffd3398}
        ret = <optimized out>
        __func__ = "httpd_process_ctrl_msg"
#13 0x40184055 in httpd_server (hd=0x3ffde354) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:292
        read_set = {__fds_bits = {0, 16777216}}
        tmp_max_fd = 55
        maxfd = <optimized out>
        __func__ = "httpd_server"
        active_cnt = <optimized out>
        context = {fdset = 0x3ffe22e0, hd = 0x3ffde354}
#14 0x40184135 in httpd_thread (arg=0x3ffde354) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:327
        ret = <optimized out>
        hd = 0x3ffde354
        __func__ = "httpd_thread"
#15 0x4008cf98 in vPortTaskWrapper (pxCode=0x4018410c <httpd_thread>, pvParameters=0x3ffde354) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
        pcTaskName = <optimized out>
mike1703 commented 10 months ago

I created a minimum example where I don't need to change too much of the original esp-idf-svc code.

The minimum example can be found here: https://github.com/mike1703/rust-esp-debug/blob/main/src/main.rs I tried to make the error reproducible much quicker than the original esp-idf-svc failed and I achieved the same error way quicker with this little change: https://github.com/esp-rs/esp-idf-svc/compare/master...mike1703:esp-idf-svc:debug

https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/private/mutex.rs#L106-L109

    pub fn notify_all(&self) {
+       info!("notify_all()");
        let r = unsafe { pthread_cond_broadcast(self.0.get()) };
        debug_assert_eq!(r, 0);
    }

This adds logging when the Condvar::notify_all() call is made and probably it just increases the time spent before actually going into the notify_all() function (and dereferencing the raw pointer) and probably the main thread somehow deinitialized the pointer contained in the Condvar.

This is the part of the code I think is relevant where the race condition seems to happen: https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1196-L1251

I still don't see exactly how this is failing but I hope this example is enough to give someone else the chance to see this error.

With the logging active the error can be reproduced very easily in ~30-60 seconds on my ESP32 after connecting with $ /usr/bin/python -m websockets ws://<esp_hostname>/data.

Without the added logging it can take hours or a day to be "lucky" and reproduce this issue. My actual appliance runs into the issue 2-4 times a day. Sometimes more often, sometimes less.

mike1703 commented 10 months ago

Hmm, it could very unlucky timing in this code section as follows

This initializes the error_code (Mutex) with None which is kept in this function and also passed into the httpd thread: https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1208

This sets the error_code to Some(ret) in enqueue() to signal that the function is ready and will notify_all() next. https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1249

And here a very unlucky timing can cause the guard to contain Some (is_none() != true) from the enqueue function, causing it to not enter the while() loop, not condvar.wait(guard) and finally drop everything including error_code and condvar https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1220-L1223

causing this notify_all() in the https thread to fail https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1250-L1250

Usually rust should protect from such issues but here unsafe is used with a pointer to the datastructure so there is no actual borrowing checker for this data etc. It is unsafe as rust had to be told. https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1212-L1216

More protection is needed to not drop the send_request datastructure var to soon.

mike1703 commented 10 months ago

I did some additional debugging and included a check to see if the while loop here was executed at least once: https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1220-L1225

let mut check = false;
[...]
 while guard.is_none() { 
     guard = send_request.condvar.wait(guard); 
+    check = true;
 } 

and check if this is false (this should never be the case!)

https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1226-L1232

         return Err(EspError::from_infallible::<ESP_FAIL>()); 
     } 

+    if !check {
+        info!("check = {check}");
+    }
     Ok(()) 
 } 

and I get a lot of check = false output. Here is some actual output from my test with this code

I (31500) esp_idf_svc::private::mutex: notify_all()
I (31500) esp_idf_svc::http::server::ws: check = false
I (31510) esp_idf_svc::private::mutex: notify_all()
I (31520) esp_idf_svc::private::mutex: notify_all()
I (31520) esp_idf_svc::private::mutex: notify_all()
I (31530) esp_idf_svc::private::mutex: notify_all()
I (31530) esp_idf_svc::private::mutex: notify_all()
I (31540) esp_idf_svc::private::mutex: notify_all()
I (31550) esp_idf_svc::private::mutex: notify_all()
I (31550) esp_idf_svc::http::server::ws: check = false
I (31560) esp_idf_svc::private::mutex: notify_all()
I (31560) esp_idf_svc::private::mutex: notify_all()

Now it does not crash anymore (during my pretty short tests though), as the httpd thread gains more time to execute the code in https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1250 as this is now valid for the longer time it needs to do all the info logging. So my assumption was correct that the guard_is_none() check is sometimes true on the first attempt.

ivmarkov commented 10 months ago

@mike1703 Thanks for your detailed analysis! I think the fix is as simple as replacing

*request.error_code.lock() = Some(ret as _);
request.condvar.notify_all();

with

let mut guard = request.error_code.lock();

*guard = Some(ret as _);

request.condvar.notify_all();

Can you try it out? I'll commit it shortly as well, as I'm relatively sure it would address the problem.

mike1703 commented 10 months ago

Thank you for this solution!

Your change is prolonging the lifetime of the lock guard, so it lives beyond the notify_all() call before dropping the guard. I had not seen before that the guard was dropped immediately after setting the value. notify_all() did work sometimes on a already destroyed condvar due to race conditions as the main thread could grab the lock too soon due to this.

I now also see that in my previous statements I have been confused a little bit. check == false does not mean anything but that the httpd thread had written the return value before the first iteration of the while loop, which is completely legitimate...

A first test with your proposed change does not crash even after a few minutes (without that change I can crash in 20-30 seconds)

Thank you!

ivmarkov commented 10 months ago

Let's wait for a couple of hours or days if necessary, and once you feel comfortable it works now, you can close this.

mike1703 commented 10 months ago

This fixed my issue. My device did not crash since yesterday