NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

import transformer_engine initializes CUDA #872

Open szmigacz opened 1 month ago

szmigacz commented 1 month ago
>>> import torch
>>> torch.cuda.is_initialized()
False
>>> import transformer_engine
>>> torch.cuda.is_initialized()
True

Import alone shouldn't initialize CUDA. Custom subprocess launchers are going to fail with RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method if CUDA is initialized.

timmoon10 commented 1 month ago

The root cause is because torch.compile initializes CUDA in https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/jit.py. If I run with NVTE_TORCH_COMPILE=0 in the environment (to use nvFuser instead of torch.compile), importing TE doesn't initialize CUDA.

Possible next steps: