JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.2k stars 218 forks source link

Rework context handling #2346

Closed maleadt closed 5 months ago

maleadt commented 5 months ago

Problem

CUDA contexts are annoying:

All this significantly complicates our ability to determine whether objects are safe to use and/or need to be finalized. Currently, we solve this by using some kind of factory method that is guaranteed to return a unique context object for every session of a context handle (i.e., after handle destruction and recreation, this method returns a different object despite the handle being identical). In combination with targeted invalidation of that object from all known APIs that destroy a context, that makes it possible to automatically determine context validity in all derived objects storing a reference. All this relies on multiple global dictionaries, which are slow and fragile (and have resulted in several issues with use with threads, and in finalizers where we can't take locks to safely access that global dict). It also isn't guaranteed to be correct, especially when cooperating with other software that may call context APIs.

Solution

CUDA 12.0 provides a new driver API, cuCtxGetId, which returns a monotonically incrementing identifier that does change when a context is destroyed and re-allocated. This greatly simplifies the design:

This makes it possible to demote CuContext to a simple immutable type, and get rid of all context-related global state, improving thread- and finalizer-safety, while making it much cheaper to store context objects in derived resources.

The flip side: CUDA.jl will require a CUDA 12.x-compatible driver. This seems acceptable to me, given the improvements in this PR and the fact that CUDA 12 has been out for quite a while. People relying on CUDA 11.x can always keep using CUDA.jl 5.x. If needed, we can even make additional releases of CUDA.jl 5.x if backport PRs are suggested.

cc @vchuravy

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 79.24528% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 60.33%. Comparing base (5dd6bb2) to head (a534c10).

:exclamation: Current head a534c10 differs from pull request most recent head 51fb828. Consider uploading reports for the commit 51fb828 to get more accurate results

Files Patch % Lines
lib/cudadrv/stream.jl 42.85% 4 Missing :warning:
src/memory.jl 70.00% 3 Missing :warning:
lib/cudadrv/context.jl 92.85% 2 Missing :warning:
lib/utils/call.jl 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #2346 +/- ## ========================================== - Coverage 62.09% 60.33% -1.77% ========================================== Files 155 155 Lines 14965 14926 -39 ========================================== - Hits 9293 9005 -288 - Misses 5672 5921 +249 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

maleadt commented 5 months ago

Alternatively, maybe I should just get rid of the ability to reset contexts; this functionality isn't even handled properly by NVIDIA's own libraries...

maleadt commented 5 months ago

Or, even better, we could just only support resetting contexts on 12+. That should make it possible to keep compatibility with older drivers, as long as people don't reset the device.

maleadt commented 5 months ago

Alright, CUDA 11.x support is back. Let's merge this once CI is green.