coreylowman / cudarc

Safe rust wrapper around CUDA toolkit
Apache License 2.0
624 stars 77 forks source link

Current destructor strategy yields panic-in-panic #277

Open workingjubilee opened 3 months ago

workingjubilee commented 3 months ago

Observe the following backtrace, from the tail end of a compute-sanitizer report. If a panic occurs when Rust is inside what is supposed to be a "nounwind" function, then Rust will adopt a more cynical attitude and start doing what it takes to contain the clearly-unsalvageable runtime.

thread 'main' panicked at ${HOME}/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/src/driver/safe/core.rs:252:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_LAUNCH_FAILED, "unspecified launch failure")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at ${HOME}/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/src/driver/safe/core.rs:252:76:
called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_LAUNCH_FAILED, "unspecified launch failure")
stack backtrace:
   0:     0x55b30644ed72 - std::backtrace_rs::backtrace::libunwind::trace::he4ee80166a02c846
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:     0x55b30644ed72 - std::backtrace_rs::backtrace::trace_unsynchronized::h476faccf57e88641
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55b30644ed72 - std::sys_common::backtrace::_print_fmt::h430c922a77e7a59c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x55b30644ed72 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hffecb437d922f988
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55b30647be4c - core::fmt::rt::Argument::fmt::hf3df69369399bfa9
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/rt.rs:142:9
   5:     0x55b30647be4c - core::fmt::write::hd9a8d7d029f9ea1a
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/fmt/mod.rs:1153:17
   6:     0x55b30644b6bf - std::io::Write::write_fmt::h0e1226b2b8d973fe
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/io/mod.rs:1843:15
   7:     0x55b30644eb44 - std::sys_common::backtrace::_print::hd2df4a083f6e69b8
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x55b30644eb44 - std::sys_common::backtrace::print::he907f6ad7eee41cb
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x55b3064502fb - std::panicking::default_hook::{{closure}}::h3926193b61c9ca9b
  10:     0x55b306450053 - std::panicking::default_hook::h25ba2457dea68e65
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:292:9
  11:     0x55b30645079d - std::panicking::rust_panic_with_hook::h0ad14d90dcf5224f
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:779:13
  12:     0x55b306450672 - std::panicking::begin_panic_handler::{{closure}}::h4a1838a06f542647
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:657:13
  13:     0x55b30644f246 - std::sys_common::backtrace::__rust_end_short_backtrace::h77cc4dc3567ca904
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x55b3064503a4 - rust_begin_unwind
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
  15:     0x55b304a8af95 - core::panicking::panic_fmt::h940d4fd01a4b4fd1
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
  16:     0x55b304a8b583 - core::result::unwrap_failed::h5119205a73b72b0d
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1654:5
  17:     0x55b3055841fa - core::result::Result<T,E>::unwrap::h4351702c55915e75
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1077:23
  18:     0x55b3055841fa - <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop::h199522ab8947bf06
                               at ${HOME}/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/src/driver/safe/core.rs:252:17
  19:     0x55b30557f437 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<f32>>::hc474f8aea52e9fe5
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  20:     0x55b30557eece - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorageSlice>::h02edab09154be304
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  21:     0x55b30557ead7 - core::ptr::drop_in_place<candle_core::cuda_backend::CudaStorage>::h4e03b09cdf55c603
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  22:     0x55b30557debc - core::ptr::drop_in_place<candle_core::storage::Storage>::h7be6009f4c2999a5
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  23:     0x55b30557fddb - core::ptr::drop_in_place<core::cell::UnsafeCell<candle_core::storage::Storage>>::hf62ca5d0a694c5b9
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  24:     0x55b30558012f - core::ptr::drop_in_place<std::sync::rwlock::RwLock<candle_core::storage::Storage>>::h9c84a0dc451d5c0b
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  25:     0x55b3054f4f9f - alloc::sync::Arc<T,A>::drop_slow::h963cb23aaa1b6829
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/sync.rs:1804:18
  26:     0x55b305581902 - <alloc::sync::Arc<T,A> as core::ops::drop::Drop>::drop::h0bfec242c64e8bd3
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/sync.rs:2459:13
  27:     0x55b30557a63b - core::ptr::drop_in_place<alloc::sync::Arc<std::sync::rwlock::RwLock<candle_core::storage::Storage>>>::hb3ee87cb7d8fccdc
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  28:     0x55b30557dd3e - core::ptr::drop_in_place<candle_core::tensor::Tensor_>::ha997d43af8c0f265
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  29:     0x55b3054f4fef - alloc::sync::Arc<T,A>::drop_slow::h9e22af1c62f14f2f
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/sync.rs:1804:18
  30:     0x55b305581a82 - <alloc::sync::Arc<T,A> as core::ops::drop::Drop>::drop::h469fb82faeb75789
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/alloc/src/sync.rs:2459:13
  31:     0x55b30557fa0b - core::ptr::drop_in_place<alloc::sync::Arc<candle_core::tensor::Tensor_>>::h476f8404d54093fe
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  32:     0x55b30557dafb - core::ptr::drop_in_place<candle_core::tensor::Tensor>::h586db7ae68ea4be0
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ptr/mod.rs:515:1
  33:     0x55b30501e6ed - candle_transformers::models::quantized_llama::ModelWeights::forward::h0dfb5d857c758ce3
                               at ${HOME}/${CRATE_USING_CANDLE}/candle-transformers/src/models/quantized_llama.rs:486:9
  34:     0x55b304a9acd2 - <ai_lib::models::llama_3_70b_instruct_32k_gguf::Llama3_70bInstruct32kGGUF as ai_lib::models::model_wrapper::ModelWrapper>::inference::h279f36a3b9cc9fbf
                               at ${HOME}/${CARGO_PROJECT}/ai-lib/src/models/llama_3_70b_instruct_32k_gguf.rs:189:26
  35:     0x55b304a9787a - cli::main::h8ee7a7979e775a0e
                               at ${HOME}/${CARGO_PROJECT}/cli/src/main.rs:103:24
  36:     0x55b304a96fcb - core::ops::function::FnOnce::call_once::h772466b7bf645693
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  37:     0x55b304a9740e - std::sys_common::backtrace::__rust_begin_short_backtrace::h8af7e217acbaa4da
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  38:     0x55b304a96f71 - std::rt::lang_start::{{closure}}::h67443931d40186ff
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
  39:     0x55b306443a03 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h52f5991f9ab8b369
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
  40:     0x55b306443a03 - std::panicking::try::do_call::h0ac4bee9a397a1bf
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  41:     0x55b306443a03 - std::panicking::try::hc005decaf198d0ed
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  42:     0x55b306443a03 - std::panic::catch_unwind::hb0f967d870b2a382
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  43:     0x55b306443a03 - std::rt::lang_start_internal::{{closure}}::hd140b84b0efe534b
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  44:     0x55b306443a03 - std::panicking::try::do_call::h1ddfaf1d0d576c38
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  45:     0x55b306443a03 - std::panicking::try::hdd4bdf855547659f
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  46:     0x55b306443a03 - std::panic::catch_unwind::h276ba91c7706110c
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  47:     0x55b306443a03 - std::rt::lang_start_internal::h103c42a9c4e95084
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  48:     0x55b304a96f4a - std::rt::lang_start::he3400f8001dc9f83
                               at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:165:17
  49:     0x55b304a97c7e - main
  50:     0x7f850a229d90 - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  51:     0x7f850a229e40 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:392:3
  52:     0x55b304a96e45 - _start
  53:                0x0 - <unknown>
thread 'main' panicked at library/core/src/panicking.rs:164:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 4835 errors
========= ERROR SUMMARY: 4735 errors were not printed. Use --print-limit option to adjust the number of printed errors
coreylowman commented 3 months ago

Hmm yeah so this happens inside the Drop for a CudaSlice. The call to free async is returning an error. Notably the error is actually from a previous call (which is totally normal in cuda land). So the error does not necessarily have to do with the free.

However I'm not sure what the expectation is for rust side in this case?

Since Drop can't return a result, even if we determine that the error encountered here is not because of the call to free, the system is still in an error state, and I think we'd be in undefined behavior territory.

So I guess my first reaction is this is working as intended?

Does anyone have any better outcome to this situation?

workingjubilee commented 3 months ago

What do you mean by "I think we'd be in undefined behavior territory"?

coreylowman commented 3 months ago

Well if one of the kernels fails to launch, continuing on could result in undefined behavior. For example maybe the kernel the failed was setting uninitialized memory, and if we continue on, then successive kernels may be working with memory that wasn't initialized.

Also I'm not sure if the call to free in this case succeeds or not if it returns back an error from a previous call. So the memory is in an unclear state. The nvidia docs don't specify so to me that is undefined behavior (could be either).

workingjubilee commented 3 months ago

Hmm. So, when I discuss the soundness of Rust programs, and I say "undefined behavior", I do not mean "it may be this or that". I mean that a generous compiler would then abort the program if it recognized the case occurred, and that most compilers will simply ignore it and do something that should never happen, like arrange the code so that code that is never called, is reached.

This is most useful to distinguish this from a nondeterministic result. By which I mean, the classic "one of 2 outcomes occurred, and it's not clear which".

Part of why it is very useful to make this distinction is sometimes acting on the assumption that one of two possible results happened can lead to full-stop no-holds-barred undefined behavior. So nondeterminism is not UB yet, but it is a common source of UB in a program (e.g. if we free a second time "just to make sure" we may double-free).

And as far as uninitialized memory, it is allowed to manipulate uninitialized memory in a Rust program. It is not even unsafe:

fn main() {
    let byte = std::mem::MaybeUninit::<u8>::uninit();
    // reading uninit
    let mut byte = byte;
    byte.write(5);
}

Deallocated memory, or never-allocated memory, however, should not be examined.

Anyways, in another example I had encountered, the error was, in fact, CUDA_ERROR_ILLEGAL_ADDRESS. I am slightly concerned that what may wind up happening sometimes is that a panic occurs for other reasons and, the panic unwinds the stack, starts executing destructors, and because we're in a "doomed universe"... as you said, a prior error can be returned... we essentially-automatically panic-in-panic. This makes it harder to debug an already-bad situation.

coreylowman commented 3 months ago

Fair points, that all tracks with me.

I guess I'm still not sure if there's a better way to handle this inside cudarc. One CUDA specific thing that would help debugging in this situation is to set export CUDA_LAUNCH_BLOCKING=1, which basically makes all the launch calls blocking. That would take care of the fact that this error is from a previous launch.

However in the case where the call to free actually fails with its own error, I'm not sure there's much we can do other than panic.

workingjubilee commented 3 months ago

Thanks for the tip! I'll try to use that for debugging the issues I'm seeing a bit more intently.

One option is that you could consider whether Rust is currently panicking and decide on whether or not to throw another panic or whether or not to cudaFreeAsync then? Definitely a panic should happen, but it's not clear to me that panicking and then panicking again is useful.