Closed mike1703 closed 10 months ago
I changed my release build with:
[profile.release]
opt-level = 1
debug-assertions = true
I managed to get another crash with the same behavior but different trace.
#0 0x400831f9 in panic_abort (details=0x3ffe208b "assert failed: xQueueSemaphoreTake queue.c:1657 (pxQueue->uxItemSize == 0)") at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_system/panic.c:452
No locals.
#1 0x4008a040 in esp_system_abort (details=0x3ffe208b "assert failed: xQueueSemaphoreTake queue.c:1657 (pxQueue->uxItemSize == 0)") at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_system/port/esp_system_chip.c:84
No locals.
#2 0x400920dc in __assert_func (file=0x3f41cc06 "queue.c", line=<optimized out>, func=<optimized out>, expr=0x3f41cfa0 "pxQueue->uxItemSize == 0") at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/assert.c:81
addr = "\000\000\000\000\000\000\000\000\000\000"
buff = "assert failed: xQueueSemaphoreTake queue.c:1657 (pxQueue->uxItemSize == 0)\000\000\000\000\000\000\000\f\000\000\000#\v\006\000 !\376?@\276\027@p!\376?#\v\006\000 !\376?@\276\027@p!\376?#\v\006\000|\264\374?\001\000\000\000\377\377?\263 ?\b@\360\277\000@0\005\006\000\205\303\b\200\320!\376?\000\000\000\000#\005\006\000 \005\006\000#\005\006\000\377\377?\263\377\377?\263\001\000\000\000\004\000\000\000#\005\006", '\000' <repeats 11 times>
lbuf = "1657"
rem_len = 125
off = <optimized out>
str = {0x3ffb5150 "assert failed: ", 0x3f41d13c <__func__$9> "xQueueSemaphoreTake", 0x3ffb59ac " ", 0x3f41cc06 "queue.c", 0x3ffb5160 ":", 0x3ffe2153 "1657", 0x3ffb5164 " (", 0x3f41cfa0 "pxQueue->uxItemSize == 0", 0x3ffb5038 ")"}
#3 0x4008aa09 in xQueueSemaphoreTake (xQueue=0x3ffdcd38, xTicksToWait=<optimized out>) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/queue.c:1657
xEntryTimeSet = 0
xTimeOut = {xOverflowCount = 0, xTimeOnEntering = 0}
pxQueue = 0x3ffdcd38
xInheritanceOccurred = 0
__func__ = "xQueueSemaphoreTake"
#4 0x4008ab27 in xQueueTakeMutexRecursive (xMutex=0x3ffdcd38, xTicksToWait=4294967295) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/queue.c:783
xReturn = <optimized out>
pxMutex = 0x3ffdcd38
__func__ = "xQueueTakeMutexRecursive"
#5 0x400843e8 in lock_acquire_generic (lock=0x3ffdc6fc, delay=4294967295, mutex_type=4 '\004') at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:144
h = 0x3ffdcd38
__func__ = "lock_acquire_generic"
success = <optimized out>
#6 0x400844f0 in _lock_acquire_recursive (lock=0x3ffdc6fc) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:158
No locals.
#7 0x40159d14 in pthread_cond_broadcast (cv=0x3ffd33b4) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/pthread/pthread_cond_var.c:82
res = 0
cond = 0x3ffdc6fc
entry = <optimized out>
#8 0x4013bec0 in esp_idf_svc::private::mutex::RawCondvar::notify_all (self=0x0) at src/private/mutex.rs:107
No locals.
#9 esp_idf_svc::private::mutex::{impl#10}::notify_all (self=0x0) at src/private/mutex.rs:148
No locals.
#10 embedded_svc::utils::mutex::Condvar<esp_idf_svc::private::mutex::RawCondvar>::notify_all<esp_idf_svc::private::mutex::RawCondvar> (self=0x0) at /home/esp/.cargo/registry/src/index.crates.io-6f17d22bba15001f/embedded-svc-0.26.4/src/utils/mutex.rs:176
No locals.
#11 esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue (arg=0x3ffd3398) at src/http/server.rs:1250
ret = 0
request = 0x3ffd3398
#12 0x40183884 in httpd_process_ctrl_msg (hd=<optimized out>) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:224
msg = {hc_msg = HTTPD_CTRL_WORK, hc_work = 0x4013be74 <esp_idf_svc::http::server::ws::EspHttpWsDetachedSender::enqueue>, hc_work_arg = 0x3ffd3398}
ret = <optimized out>
__func__ = "httpd_process_ctrl_msg"
#13 0x40184055 in httpd_server (hd=0x3ffde354) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:292
read_set = {__fds_bits = {0, 16777216}}
tmp_max_fd = 55
maxfd = <optimized out>
__func__ = "httpd_server"
active_cnt = <optimized out>
context = {fdset = 0x3ffe22e0, hd = 0x3ffde354}
#14 0x40184135 in httpd_thread (arg=0x3ffde354) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/esp_http_server/src/httpd_main.c:327
ret = <optimized out>
hd = 0x3ffde354
__func__ = "httpd_thread"
#15 0x4008cf98 in vPortTaskWrapper (pxCode=0x4018410c <httpd_thread>, pvParameters=0x3ffde354) at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
pcTaskName = <optimized out>
I created a minimum example where I don't need to change too much of the original esp-idf-svc code.
The minimum example can be found here: https://github.com/mike1703/rust-esp-debug/blob/main/src/main.rs I tried to make the error reproducible much quicker than the original esp-idf-svc failed and I achieved the same error way quicker with this little change: https://github.com/esp-rs/esp-idf-svc/compare/master...mike1703:esp-idf-svc:debug
pub fn notify_all(&self) {
+ info!("notify_all()");
let r = unsafe { pthread_cond_broadcast(self.0.get()) };
debug_assert_eq!(r, 0);
}
This adds logging when the Condvar::notify_all()
call is made and probably it just increases the time spent before actually going into the notify_all()
function (and dereferencing the raw pointer) and probably the main thread somehow deinitialized the pointer contained in the Condvar
.
This is the part of the code I think is relevant where the race condition seems to happen: https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1196-L1251
I still don't see exactly how this is failing but I hope this example is enough to give someone else the chance to see this error.
With the logging active the error can be reproduced very easily in ~30-60 seconds on my ESP32 after connecting with $ /usr/bin/python -m websockets ws://<esp_hostname>/data
.
Without the added logging it can take hours or a day to be "lucky" and reproduce this issue. My actual appliance runs into the issue 2-4 times a day. Sometimes more often, sometimes less.
Hmm, it could very unlucky timing in this code section as follows
This initializes the error_code
(Mutex
) with None
which is kept in this function and also passed into the httpd thread:
https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1208
This sets the error_code
to Some(ret)
in enqueue()
to signal that the function is ready and will notify_all()
next.
https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1249
And here a very unlucky timing can cause the guard
to contain Some
(is_none() != true
) from the enqueue
function, causing it to not enter the while()
loop, not condvar.wait(guard)
and finally drop everything including error_code
and condvar
https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1220-L1223
causing this notify_all()
in the https thread to fail https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1250-L1250
Usually rust should protect from such issues but here unsafe
is used with a pointer to the datastructure so there is no actual borrowing checker for this data etc. It is unsafe as rust had to be told.
https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1212-L1216
More protection is needed to not drop the send_request
datastructure var to soon.
I did some additional debugging and included a check to see if the while loop here was executed at least once: https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1220-L1225
let mut check = false;
[...]
while guard.is_none() {
guard = send_request.condvar.wait(guard);
+ check = true;
}
and check if this is false (this should never be the case!)
return Err(EspError::from_infallible::<ESP_FAIL>());
}
+ if !check {
+ info!("check = {check}");
+ }
Ok(())
}
and I get a lot of check = false output. Here is some actual output from my test with this code
I (31500) esp_idf_svc::private::mutex: notify_all()
I (31500) esp_idf_svc::http::server::ws: check = false
I (31510) esp_idf_svc::private::mutex: notify_all()
I (31520) esp_idf_svc::private::mutex: notify_all()
I (31520) esp_idf_svc::private::mutex: notify_all()
I (31530) esp_idf_svc::private::mutex: notify_all()
I (31530) esp_idf_svc::private::mutex: notify_all()
I (31540) esp_idf_svc::private::mutex: notify_all()
I (31550) esp_idf_svc::private::mutex: notify_all()
I (31550) esp_idf_svc::http::server::ws: check = false
I (31560) esp_idf_svc::private::mutex: notify_all()
I (31560) esp_idf_svc::private::mutex: notify_all()
Now it does not crash anymore (during my pretty short tests though), as the httpd thread gains more time to execute the code in https://github.com/esp-rs/esp-idf-svc/blob/6a68206e502238ed30830f88b1f253922dc35bb0/src/http/server.rs#L1250 as this is now valid for the longer time it needs to do all the info logging.
So my assumption was correct that the guard_is_none()
check is sometimes true on the first attempt.
@mike1703 Thanks for your detailed analysis! I think the fix is as simple as replacing
*request.error_code.lock() = Some(ret as _);
request.condvar.notify_all();
with
let mut guard = request.error_code.lock();
*guard = Some(ret as _);
request.condvar.notify_all();
Can you try it out? I'll commit it shortly as well, as I'm relatively sure it would address the problem.
Thank you for this solution!
Your change is prolonging the lifetime of the lock guard
, so it lives beyond the notify_all()
call before dropping the guard
.
I had not seen before that the guard was dropped immediately after setting the value. notify_all()
did work sometimes on a already destroyed condvar
due to race conditions as the main thread could grab the lock too soon due to this.
I now also see that in my previous statements I have been confused a little bit. check == false
does not mean anything but that the httpd thread had written the return value before the first iteration of the while loop, which is completely legitimate...
A first test with your proposed change does not crash even after a few minutes (without that change I can crash in 20-30 seconds)
Thank you!
Let's wait for a couple of hours or days if necessary, and once you feel comfortable it works now, you can close this.
This fixed my issue. My device did not crash since yesterday
I have a http server running on my ESP32 with a websocket endpoint created with
ws_handler()
which stores the result ofws_connection.create_detached_sender()
in my app state to be able to regularly send some data with thisSometimes it fails with:
The code at /builds/rust-project/.embuild/espressif/esp-idf/v5.1.2/components/newlib/locks.c:111
The resulting coredump provides the following information
What I noticed is that the CondVar seems to be always 0x0 once it crashes, leading to
pthread_cond_broadcast()
be feed with a pointer to a 0x0 ending withlock_acquire_generic()
dereferencing a 0x0 (and LoadProhibited). I still try to create a minimal example showing this behavior but was not yet able to reproduce the error but this often happens only after a few hours or a day in my real project. I have two threads reading and writing from two UARTs and several other threads doing other things providing data to a central component which sends data via this detached sender. Not sure how these other threads could interfere.The
EspHttpWsDetachedSender::send()
function seems to create and drop theCondVar
whichEspHttpWsDetachedSender::enqueue()
uses in the httpd_thread in between so it should not be invalid.I don't understand why there can be a dereference of 0x0 but to me it looks like there is a bug lurking somewhere in esp-idf-svc.