Open szmigacz opened 1 month ago
The root cause is because torch.compile
initializes CUDA in https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/jit.py. If I run with NVTE_TORCH_COMPILE=0
in the environment (to use nvFuser instead of torch.compile
), importing TE doesn't initialize CUDA.
Possible next steps:
NVTE_TORCH_COMPILE=0
.
Import alone shouldn't initialize CUDA. Custom subprocess launchers are going to fail with
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
if CUDA is initialized.