NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.6k stars 255 forks source link

Can't install fatal error: <path-to-conda-env>/lib/python3.8/site-packages/torch/include/ATen/ops/argmax.h: No such file or directory #933

Closed saurabh-kataria closed 1 week ago

saurabh-kataria commented 2 weeks ago

Everytime I try to compile from source, I get this kind of error: fatal error: <path-to-conda-env>/lib/python3.8/site-packages/torch/include/ATen/ops/argmax.h: No such file or directory Everytime it is some different file in the ATen folder of pytorch which it says is missing but it is not.

If I reinstall pytorch (compile from source nightly cuda 12.4 version), this disappears but following appears /home/skatar6/TransformerEngine/transformer_engine/common/util/../common.h:20:10: fatal error: cuda_fp8.h: No such file or directory

timmoon10 commented 2 weeks ago

Can you check your CUDA version (e.g. with nvcc --version)? TE requires CUDA 11.8 or newer, which includes cuda_fp8.h.

If your CUDA version is supported, I suspect the problem is from CMake. If you have multiple CUDA installations on your system, it might be detecting and using an old version. Try following one of these instructions to force it to use the right CUDA version. It may be helpful to pass the --verbose flag to pip install so that you can see the CMake build logs.

saurabh-kataria commented 1 week ago

Thanks for the useful links. I think my machine had several CUDA, cuDNN, pytorch, and disk issues. I installed everything from scratch and after much experimentation, the following command could install the package finally:

export TMPDIR=/home/$USER/tmp && export CMAKE_TEMP_DIR=/home/$USER/tmp && export BUILD_DIR=/home/$USER/tmp/build
mkdir -p $TMPDIR && mkdir -p $CMAKE_TEMP_DIR && mkdir -p $BUILD_DIR
TMP_DIR=/home/$USER/tmp MAX_JOBS=1 CUDA_HOME=$CUDA_HOME CUDNN_PATH=$CUDNN_PATH CC=$CC CXX=$CXX \ 
    pip -v install --no-deps --cache-dir /home/$USER/tmp/pip-cache git+https://github.com/NVIDIA/TransformerEngine.git@stable