NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

Ubuntu session close during building wheel step #873

Closed Ciclarion closed 1 month ago

Ciclarion commented 1 month ago

Hello,

My system : Ubuntu 24.04 Cuda 12.1 CuDNN 8.9.2 Python 3.10

I've quite a strange problem. When i'm trying to install TransformerEngine with pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable, my computer "crash" during the build_wheel step as my ubuntu session automatically close !

I also tried building from source, and it has the same behavior during the running setup.py

As it crashes, i've no error message so no idea what could be the problem....

ptrendx commented 1 month ago

Most probably you encounter OoM error. Could you try changing threads value in this line: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/CMakeLists.txt#L17 from 4 to 1? It should take longer but use less memory.

Ciclarion commented 1 month ago

Thanks for the quick answer. I tried to change, but sadly, still the same problem. However as i was monitoring the gpu usage with nvidia-smi; it didn't seems to grow at all during the building. (For information, I have one rtx3090). I'll try to see what else i can change.

Edit: After checking, the CMakeChache.txt which is created contain the line "CMAKE_CUDA_FLAGS:STRING= " with nothing. Don't now if it's normal

Ciclarion commented 1 month ago

It was effectively an OOM error, and i had to change the MAX_NUM_WORK env for ninja build !