[FEA]: Enable exceptions by default

PointKernel commented 2 months ago

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

Originally posted by @kroburg at https://github.com/NVIDIA/cuCollections/issues/589

cuda::stream_ref::wait can throw exceptions

void wait() const
{
  _CCCL_TRY_CUDA_API(::cudaStreamSynchronize, "Failed to synchronize stream.", get());
}

_CCCL_NORETURN inline _LIBCUDACXX_INLINE_VISIBILITY void __throw_cuda_error(::cudaError_t, const char*)
{
  _CUDA_VSTD_NOVERSION::terminate();
}

but throw terminates normally with a return value of 0.

_CCCL_NORETURN inline _LIBCUDACXX_INLINE_VISIBILITY void terminate() noexcept
{
  __cccl_terminate();
  _LIBCUDACXX_UNREACHABLE();
}

_CCCL_NORETURN inline _LIBCUDACXX_INLINE_VISIBILITY void __cccl_terminate() noexcept
{
  NV_IF_ELSE_TARGET(NV_IS_HOST, (::std::exit(0);), (__trap();))
  _LIBCUDACXX_UNREACHABLE();
}

CCCL disables exceptions by default, meaning that most APIs that are expected to throw exceptions never throw. One reason for this is that destructors should never throw exceptions (#683). Could we revisit this decision and consider enabling exceptions by default?

Describe the solution you'd like

Enable exceptions and find a proper way to "throw".

Describe alternatives you've considered

No response

Additional context

Not sure about the scope of this request, this could be a general CCCL feature request.

jrhemstad commented 2 months ago

Hmmm, this is surprising behavior to me.

It looks like we're defining _LIBCUDACXX_NO_EXCEPTIONS by default.

https://godbolt.org/z/b3dWsoTqa

jrhemstad commented 2 months ago

There's a secondary issue that cuda::std::terminate shouldn't exit with exit code 0.

leofang commented 1 month ago

Preferably trap is not used in device code: https://github.com/NVIDIA/cccl/issues/939#issuecomment-1802542072

jrhemstad commented 1 month ago

Preferably trap is not used in device code: #939 (comment)

That's the only real option in device code that would otherwise throw an exception in host code.

leofang commented 1 month ago

We need an alternative for Python because __trap corrupts the CUDA context and makes all subsequent CUDA operations fail. It is not a recoverable error and for long-running applications (ex: ML trainings, simulations, ...) or interactive sessions (ex: Jupyter notebooks) it's just too disruptive. For those (ex: Google) who have a sophisticated error mitigation, checkpointing, and/or fault tolerant stack in place, it might not be a big deal, but it's certainly not something available to general users and I would hope that libcudacxx takes this into account.

miscco commented 1 month ago

I would hope that libcudacxx takes this into account.

I am happy to get to a solution that works better for the python folks, but I would need some suggestions on how you believe exceptions could be implemented

leofang commented 1 month ago

Thanks, Michael! I don't believe there's any silver bullet in this, all we can do is to compromise. This is admittedly a particularly hard problem to libcudacxx, where there's no API that allows passing in an opaque library handle or descriptor that offers a natural definition of a library "context/scope;" as all C++ std library APIs are stateless functions. A scoped object would allow systematic deviations from the default behavior as well as offer a customization entry point.

As a user who want device-side error handling, I may want to do one or multiple of the following:

Kernel input checks: Sometimes the validity checks of input parameters are too expansive to be run by an independent kernel. We want a mega kernel that does both input checks and,
- if checks pass, then actual computation, or
- if any check fails, abort the kernel and return the error/control to users on host
Run-time checks: As the computation proceeds, we want to perform on-the-fly checks to ensure the problem state is still valid (and usually an "error" is not really deadly, just a mean to return the control back to host-side handling)
A way for simple logging/printing: see, e.g. https://github.com/NVIDIA/cccl/issues/939, in particular https://github.com/NVIDIA/cccl/issues/939#issuecomment-2317484439.

If we can allow in some way for users to pass a host pinned buffer and define a custom operator that can be fused into any device function, and libcudacxx writes to the buffer when a device-side error occurs and asks the user to use a host thread to monitor the buffer's value change, that'd be a good start. The trouble is I can't think of a generic and robust way of doing so without using a library handle mentioned earlier. I would hate to use a global config unless its visibility can be somehow localized to only one translation unit (again, a compromise).

miscco commented 1 month ago

I am working on getting assertions into libcu++.

Similar to what cub has we will also go in the direction of adding logging etc to the mix

NVIDIA / cccl