Closed saurabh-kataria closed 1 week ago
Can you check your CUDA version (e.g. with nvcc --version
)? TE requires CUDA 11.8 or newer, which includes cuda_fp8.h
.
If your CUDA version is supported, I suspect the problem is from CMake. If you have multiple CUDA installations on your system, it might be detecting and using an old version. Try following one of these instructions to force it to use the right CUDA version. It may be helpful to pass the --verbose
flag to pip install
so that you can see the CMake build logs.
Thanks for the useful links. I think my machine had several CUDA, cuDNN, pytorch, and disk issues. I installed everything from scratch and after much experimentation, the following command could install the package finally:
export TMPDIR=/home/$USER/tmp && export CMAKE_TEMP_DIR=/home/$USER/tmp && export BUILD_DIR=/home/$USER/tmp/build
mkdir -p $TMPDIR && mkdir -p $CMAKE_TEMP_DIR && mkdir -p $BUILD_DIR
TMP_DIR=/home/$USER/tmp MAX_JOBS=1 CUDA_HOME=$CUDA_HOME CUDNN_PATH=$CUDNN_PATH CC=$CC CXX=$CXX \
pip -v install --no-deps --cache-dir /home/$USER/tmp/pip-cache git+https://github.com/NVIDIA/TransformerEngine.git@stable
Everytime I try to compile from source, I get this kind of error:
fatal error: <path-to-conda-env>/lib/python3.8/site-packages/torch/include/ATen/ops/argmax.h: No such file or directory
Everytime it is some different file in the ATen folder of pytorch which it says is missing but it is not.If I reinstall pytorch (compile from source nightly cuda 12.4 version), this disappears but following appears
/home/skatar6/TransformerEngine/transformer_engine/common/util/../common.h:20:10: fatal error: cuda_fp8.h: No such file or directory