Closed jpalus closed 2 years ago
It appears loop at:
#3 0x005fc584 in crossbeam_epoch::sync::list::List<T,C>::insert (self=0xf7100f40, container=..., guard=0x643530 <crossbeam_epoch::guard::unprotected::UNPROTECTED>)
at build/reproducer/vendor/crossbeam-epoch/src/sync/list.rs:186
never breaks
https://github.com/crossbeam-rs/crossbeam/blob/master/crossbeam-epoch/src/sync/list.rs#L182-L192
#0 0xf7856778 in pthread_cond_wait@@GLIBC_2.4 () from /lib/libpthread.so.0
#1 0x0054b6b8 in std::sys::unix::condvar::Condvar::wait (self=<optimized out>, mutex=0x80) at library/std/src/sys/unix/condvar.rs:69
#2 std::sys_common::condvar::Condvar::wait (self=<optimized out>, mutex=<optimized out>) at library/std/src/sys_common/condvar.rs:50
#3 std::sync::condvar::Condvar::wait (self=0x21b2e88, guard=...) at library/std/src/sync/condvar.rs:196
#4 rayon_core::sleep::Sleep::sleep (self=0x21b33d4, idle_state=0xf78306f0, latch=<optimized out>, has_injected_jobs=...) at build/reproducer/vendor/rayon-core/src/sleep/mod.rs:226
#5 0x0055085c in rayon_core::sleep::Sleep::no_work_found (self=<optimized out>, idle_state=0xf78306f0, latch=0x21b2cb8, has_injected_jobs=...)
at build/reproducer/vendor/rayon-core/src/sleep/mod.rs:120
#6 rayon_core::registry::WorkerThread::wait_until_cold (self=<optimized out>, latch=<optimized out>) at build/reproducer/vendor/rayon-core/src/registry.rs:733
#7 0x0054f84c in rayon_core::registry::WorkerThread::wait_until (self=0xf78308c0, latch=<optimized out>) at build/reproducer/vendor/rayon-core/src/registry.rs:704
#8 rayon_core::registry::main_loop (registry=..., index=0, worker=...) at build/reproducer/vendor/rayon-core/src/registry.rs:837
#9 rayon_core::registry::ThreadBuilder::run (self=...) at build/reproducer/vendor/rayon-core/src/registry.rs:56
#10 0x00552ef8 in <rayon_core::registry::DefaultSpawn as rayon_core::registry::ThreadSpawn>::spawn::{{closure}} () at build/reproducer/vendor/rayon-core/src/registry.rs:101
#11 std::sys_common::backtrace::__rust_begin_short_backtrace (f=...) at library/std/src/sys_common/backtrace.rs:137
#12 0x005490d4 in std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} () at library/std/src/thread/mod.rs:464
#13 <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=..., _args=<optimized out>) at library/std/src/panic.rs:308
#14 std::panicking::try::do_call (data=<optimized out>) at library/std/src/panicking.rs:381
#15 std::panicking::try (f=...) at library/std/src/panicking.rs:345
#16 std::panic::catch_unwind (f=...) at library/std/src/panic.rs:382
#17 std::thread::Builder::spawn_unchecked::{{closure}} () at library/std/src/thread/mod.rs:463
#18 core::ops::function::FnOnce::call_once{{vtable-shim}} () at library/core/src/ops/function.rs:227
#19 0x00590794 in std::sys::unix::thread::Thread::new::thread_start ()
I suppose it's rayon-core now? Especially that it's sufficient to optimize only rayon-core in order to trigger issue.
Loop that goes on forever is now:
#6 rayon_core::registry::WorkerThread::wait_until_cold (self=<optimized out>, latch=<optimized out>) at build/reproducer/vendor/rayon-core/src/registry.rs:733
https://github.com/rayon-rs/rayon/blob/master/rayon-core/src/registry.rs#L714-L733
And just to be clear all of this works fine on said machine but with 64-bit userspace. It's only this specific combination of 64-bit hardware/kernel + 32bit userspace which triggers it.
Thank you for the detailed report!
I think the root cause is buggy hardware implementation: a weak CAS at least "tries" to succeed, and always failing the operation is a no-go...
Still, I think Crossbeam (and Rayon?) should be patched to support such hardware as well. I'd like to use this issue as an opportunity to review all occurrences of weak CAS.
Just as a summary here's an overview under what conditions weak CAS works and doesn't work:
Hardware | Userspace | crossbeam-epoch opt-level | Result |
---|---|---|---|
32-bit | * | * | OK |
64-bit | 64-bit | * | OK |
64-bit | 32-bit | opt-level=0 | OK |
64-bit | 32-bit | opt-level>=1 | FAIL |
Maybe related: https://github.com/rust-lang/rust/issues/60605
@taiki-e great finding! After switching to abi.cp15_barrier=2 everything works fine without any code changes. Pitty it's still not picked up by LLVM team for so long
Closing in favor of the upstream issue.
As requested in rayon-rs/rayon#820 forwarding issue also to crossbeam:
That's related to rust-lang/rust#53670.
When running under certain specific conditions following code will hang eating 2 cores:
Conditions:
Regarding optimization level it appears that it's sufficient to optimize only crossbeam-epoch to trigger issue.
gdb backtrace from one of active threads:
rayon version: 1.5.0