The current way of handling context is fundamentally incompatible with the Runtime API

RDambrosio016 commented 2 years ago

Small tracking issue for sorting out context issues that are blocking cuBLAS and cuFFT work. The gist of it is that currently we use the "traditional" way of handling contexts per the driver API, which is such:

Make a new context when needed, this context is pushed to a thread-local stack in the driver api.
For multithreading, you get an unowned context and give that to each thread, then each thread sets the current context.
Dropping the context destroys any backing memory and resources, doing so while another thread is using the context is UB (albeit extremely rare).

This very different from what cudart does:

on any function, cudart checks if a context is made, if not, then it makes a new one.
this context is device-local and reference-counted.
Users can call cudaDeviceReset which nukes the device and the primary context.
If the driver api made a context and made it current, cudart will pick up and use that one.

This causes a good amount of issues when trying to interop with cudart, and is what is causing spurious segfaults in the cublas stuff i just pushed. What i presume is happing is:

driver pushes a context before cudart is initialized
cudart picks up on that
driver does stuff with cublas
driver drops the context, which presumably nukes anything in cudart and cublas too.
something happens when exiting which causes cudart/cublas to try and use an invalid context, making it segfault.

However, the driver API also has primary context handling, aka what cudart does except explicit, basically:

cuDevicePrimaryCtxRetain will retain a primary context handle for the device, this context is reference counted.
cuDevicePrimaryCtxRelease will release the context handle back to the driver, if this is the last handle, it will reset the context. Although presumably cudart holds on to it forever, so it will never be reset unless done explicitly.
The context is not pushed to the context stack, this context is essentially separate from the "normal" driver context handling.

So my proposal is as such:

Move the traditional context handling to cust::context::legacy, keeping the Context name to avoid too much breakage, just switch it to using primary context handling.
Update docs to reflect that the legacy way of doing contexts technically works, except it may cause a ton of issues if using cudart or cublas.

This would have a numerous amount of benefits:

No more unsoundness if you drop a context while a thread is using it because its reference counted.
No need for unowned context because again, reference counted.
Generally better for performance and for doctests, making many contexts murders performance and is usually not needed.
Should work perfectly with cudart because i presume cudart actually uses these driver API functions underneath the hood.
Makes cust compatible with libs like cuBLAS and cuFFT right off the bat, so users don't start using legacy versions of cust and making their library incompatible with cublas/cufft.
Creating contexts is no longer a gigantic expensive operation for the most part.

However, it does retain the issue of "if a user calls deviceReset from cudart or the driver, this destroys the ability for anything to do cuda work", but i don't think there is a way to 100% solve that issue, legacy context handling can do this through just dropping the context, while primary contexts can just call deviceReset. So either way a user can nuke cuda contexts if they want to. Except that deviceReset is more explicit and will probably be unsafe in cust.

I will start working on this and probably releasing these changes in cust 0.3.

RDambrosio016 commented 2 years ago

It doesn't seem like its possible to make this change without significant breakage, create_and_push will be new in the "new" context (same context struct, just different)

RDambrosio016 commented 2 years ago

New Context API has been implemented and it will be part of cust/rust-cuda 0.3

Rust-GPU / Rust-CUDA

The current way of handling context is fundamentally incompatible with the Runtime API #21