Open workingjubilee opened 3 months ago
Hmm yeah so this happens inside the Drop for a CudaSlice. The call to free async is returning an error. Notably the error is actually from a previous call (which is totally normal in cuda land). So the error does not necessarily have to do with the free.
However I'm not sure what the expectation is for rust side in this case?
Since Drop can't return a result, even if we determine that the error encountered here is not because of the call to free, the system is still in an error state, and I think we'd be in undefined behavior territory.
So I guess my first reaction is this is working as intended?
Does anyone have any better outcome to this situation?
What do you mean by "I think we'd be in undefined behavior territory"?
Well if one of the kernels fails to launch, continuing on could result in undefined behavior. For example maybe the kernel the failed was setting uninitialized memory, and if we continue on, then successive kernels may be working with memory that wasn't initialized.
Also I'm not sure if the call to free in this case succeeds or not if it returns back an error from a previous call. So the memory is in an unclear state. The nvidia docs don't specify so to me that is undefined behavior (could be either).
Hmm. So, when I discuss the soundness of Rust programs, and I say "undefined behavior", I do not mean "it may be this or that". I mean that a generous compiler would then abort the program if it recognized the case occurred, and that most compilers will simply ignore it and do something that should never happen, like arrange the code so that code that is never called, is reached.
This is most useful to distinguish this from a nondeterministic result. By which I mean, the classic "one of 2 outcomes occurred, and it's not clear which".
Part of why it is very useful to make this distinction is sometimes acting on the assumption that one of two possible results happened can lead to full-stop no-holds-barred undefined behavior. So nondeterminism is not UB yet, but it is a common source of UB in a program (e.g. if we free a second time "just to make sure" we may double-free).
And as far as uninitialized memory, it is allowed to manipulate uninitialized memory in a Rust program. It is not even unsafe:
fn main() {
let byte = std::mem::MaybeUninit::<u8>::uninit();
// reading uninit
let mut byte = byte;
byte.write(5);
}
Deallocated memory, or never-allocated memory, however, should not be examined.
Anyways, in another example I had encountered, the error was, in fact, CUDA_ERROR_ILLEGAL_ADDRESS
. I am slightly concerned that what may wind up happening sometimes is that a panic occurs for other reasons and, the panic unwinds the stack, starts executing destructors, and because we're in a "doomed universe"... as you said, a prior error can be returned... we essentially-automatically panic-in-panic. This makes it harder to debug an already-bad situation.
Fair points, that all tracks with me.
I guess I'm still not sure if there's a better way to handle this inside cudarc. One CUDA specific thing that would help debugging in this situation is to set export CUDA_LAUNCH_BLOCKING=1
, which basically makes all the launch calls blocking. That would take care of the fact that this error is from a previous launch.
However in the case where the call to free actually fails with its own error, I'm not sure there's much we can do other than panic.
Thanks for the tip! I'll try to use that for debugging the issues I'm seeing a bit more intently.
One option is that you could consider whether Rust is currently panicking and decide on whether or not to throw another panic or whether or not to cudaFreeAsync then? Definitely a panic should happen, but it's not clear to me that panicking and then panicking again is useful.
Observe the following backtrace, from the tail end of a compute-sanitizer report. If a panic occurs when Rust is inside what is supposed to be a "nounwind" function, then Rust will adopt a more cynical attitude and start doing what it takes to contain the clearly-unsalvageable runtime.