NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
2k stars 333 forks source link

Can't find `nvToolsExt` during build #879

Open kvablack opened 6 months ago

kvablack commented 6 months ago

Hi, I'm trying to install TransformerEngine for JAX. I prefer to install cuda-toolkit via conda, and it seems like most CUDA libraries (e.g., cuDNN) and being found, but the install fails due to not being able to find nvToolsExt:

-- Found Threads: TRUE
      -- cudnn found at /home/black/miniforge3/envs/monopi/lib/libcudnn.so.
      CMake Warning (dev) at /home/black/miniforge3/envs/monopi/share/cmake-3.29/Modules/FindPackageHandleStandardArgs.cmake:438 (message):
        The package name passed to `find_package_handle_standard_args` (LIBRARY)
        does not match the name of the calling package (CUDNN).  This can lead to
        problems in calling code that expects `find_package` result variables
        (e.g., `_FOUND`) to follow a certain pattern.
      Call Stack (most recent call first):
        cmake/FindCUDNN.cmake:44 (find_package_handle_standard_args)
        CMakeLists.txt:24 (find_package)
      This warning is for project developers.  Use -Wno-dev to suppress it.

      -- Found LIBRARY: /home/black/miniforge3/envs/monopi/include
      -- cuDNN: /home/black/miniforge3/envs/monopi/lib/libcudnn.so
      -- cuDNN: /home/black/miniforge3/envs/monopi/include
      -- cudnn_adv_infer found at /home/black/miniforge3/envs/monopi/lib/libcudnn_adv_infer.so.
      -- cudnn_adv_train found at /home/black/miniforge3/envs/monopi/lib/libcudnn_adv_train.so.
      -- cudnn_cnn_infer found at /home/black/miniforge3/envs/monopi/lib/libcudnn_cnn_infer.so.
      -- cudnn_cnn_train found at /home/black/miniforge3/envs/monopi/lib/libcudnn_cnn_train.so.
      -- cudnn_ops_infer found at /home/black/miniforge3/envs/monopi/lib/libcudnn_ops_infer.so.
      -- cudnn_ops_train found at /home/black/miniforge3/envs/monopi/lib/libcudnn_ops_train.so.
      -- Found Python: /home/black/miniforge3/envs/monopi/bin/python3.10 (found version "3.10.14") found components: Interpreter Development.Module
      -- JAX support: ON
      -- Performing Test HAS_FLTO
      -- Performing Test HAS_FLTO - Success
      -- Found pybind11: /tmp/pip-req-build-d2rbhz82/.eggs/pybind11-2.12.0-py3.10.egg/pybind11/include (found version "2.12.0")
      -- Configuring done (1.9s)
      CMake Error at common/CMakeLists.txt:54 (target_link_libraries):
        Target "transformer_engine" links to:

          CUDA::nvToolsExt

        but the target was not found.  Possible reasons include:

          * There is a typo in the target name.
          * A find_package call is missing for an IMPORTED target.
          * An ALIAS target is missing.

Even though it exists in lib/:

❯ find ~/miniforge3 -name "*nvToolsExt*"
/home/black/miniforge3/envs/monopi/targets/x86_64-linux/lib/libnvToolsExt.so.1
/home/black/miniforge3/envs/monopi/targets/x86_64-linux/lib/libnvToolsExt.so.1.0.0
/home/black/miniforge3/envs/monopi/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h
/home/black/miniforge3/envs/monopi/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h
/home/black/miniforge3/envs/monopi/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h
/home/black/miniforge3/envs/monopi/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h
/home/black/miniforge3/envs/monopi/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h
/home/black/miniforge3/envs/monopi/lib/libnvToolsExt.so.1
/home/black/miniforge3/envs/monopi/lib/libnvToolsExt.so.1.0.0
/home/black/miniforge3/pkgs/nsight-compute-2024.1.1.4-0/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h
/home/black/miniforge3/pkgs/nsight-compute-2024.1.1.4-0/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h
/home/black/miniforge3/pkgs/nsight-compute-2024.1.1.4-0/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h
/home/black/miniforge3/pkgs/nsight-compute-2024.1.1.4-0/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h
/home/black/miniforge3/pkgs/nsight-compute-2024.1.1.4-0/nsight-compute/2024.1.1/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h
/home/black/miniforge3/pkgs/cuda-nvtx-12.1.105-h59595ed_0/targets/x86_64-linux/lib/libnvToolsExt.so.1
/home/black/miniforge3/pkgs/cuda-nvtx-12.1.105-h59595ed_0/targets/x86_64-linux/lib/libnvToolsExt.so.1.0.0
/home/black/miniforge3/pkgs/cuda-nvtx-12.1.105-h59595ed_0/lib/libnvToolsExt.so.1
/home/black/miniforge3/pkgs/cuda-nvtx-12.1.105-h59595ed_0/lib/libnvToolsExt.so.1.0.0
/home/black/miniforge3/pkgs/cuda-nvtx-12.5.39-he02047a_0/targets/x86_64-linux/lib/libnvToolsExt.so.1
/home/black/miniforge3/pkgs/cuda-nvtx-12.5.39-he02047a_0/targets/x86_64-linux/lib/libnvToolsExt.so.1.0.0
/home/black/miniforge3/pkgs/cuda-nvtx-12.5.39-he02047a_0/lib/libnvToolsExt.so.1
/home/black/miniforge3/pkgs/cuda-nvtx-12.5.39-he02047a_0/lib/libnvToolsExt.so.1.0.0
timmoon10 commented 5 months ago

I see CUDA::nvToolsExt is deprecated as of CMake 3.25, but I don't see any indication that it's been removed. I see you're building with CMake 3.29, but I also build frequently with CMake 3.29.5 without problems. I wonder if there's some other difference in our build environments.

If the deprecation of CUDA::nvToolsExt is actually the root cause, it should just require changing to use the CUDA::nvtx3 target. Can you try building with https://github.com/NVIDIA/TransformerEngine/pull/943?

zlsh80826 commented 4 months ago

Hi @kvablack,

I am not very familiar with Conda, but I suspect the problem stems from the installed CUDA root path was not awared by the CMake. Could you try to set the environment variable CUDA_PATH=/home/black/miniforge3/pkgs/cuda-nvtx-12.1.105-h59595ed_0 (reference) before installing TE?