import transformer_engine initializes CUDA

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Apache License 2.0

1.61k stars 256 forks source link

The root cause is because torch.compile initializes CUDA in https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/jit.py. If I run with NVTE_TORCH_COMPILE=0 in the environment (to use nvFuser instead of torch.compile), importing TE doesn't initialize CUDA.

Possible next steps:

If you just need a quick workaround, run with NVTE_TORCH_COMPILE=0.
We could modify https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/jit.py so that it performs fusions lazily. This would be messier, but would make initialization faster since we only fuse the functions we use.
File a bug with PyTorch.

NVIDIA / TransformerEngine

import transformer_engine initializes CUDA #872