NVIDIA / cuda-python

CUDA Python Low-level Bindings
https://nvidia.github.io/cuda-python/
Other
809 stars 63 forks source link

Does CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM only affect the driver API? #61

Closed julee closed 2 weeks ago

julee commented 3 months ago

Version 11.6.0 added this environment variable, according to the documentation, setting this environment variable to 1 can make the default stream use per-thread stream.

However, from the code perspective, this mainly controls the use of ptds/ptsz suffix versions when loading driver API symbols. Runtime API symbols are directly linked and not affected.

So, does this environment variable only affect the driver API?

If so, this should be explained in the document.

leofang commented 3 weeks ago

Hi @julee, sorry for late reply. The env var also affects the runtime APIs because currently we re-implement cudart using the driver APIs. The cudart symbols are not directly linked. Let us know if you encounter any unexpected behavior.

julee commented 2 weeks ago

Hi @julee, sorry for late reply. The env var also affects the runtime APIs because currently we re-implement cudart using the driver APIs. The cudart symbols are not directly linked. Let us know if you encounter any unexpected behavior.

Currently, I am not encountering any issues with development; I just have some questions regarding the behavior of CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM when reviewing documentation and code.

Additionally, I have not found any re-implementation of cudart code, nor is it mentioned in the documentation, so is it closed source?

leofang commented 2 weeks ago

Additionally, I have not found any re-implementation of cudart code, nor is it mentioned in the documentation, so is it closed source?

All existing codes are source-available on GitHub (but with an NVIDIA license). Whenever you see a CUDA runtime API (with prefix cuda) internally calls a CUDA driver API (with prefix cu), it means it's a re-implementation on top of the driver. For example, here is the re-implementation of cudaMalloc: https://github.com/NVIDIA/cuda-python/blob/6044e4e26f286b6ff2ce9d55c43403459d7726f0/cuda/_lib/ccudart/ccudart.pyx.in#L3091-L3099