conda-forge / kokkos-feedstock

A conda-smithy repository for kokkos.
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

Paths and build targets with CUDA backend #21

Closed vincentmr closed 1 year ago

vincentmr commented 1 year ago

Solution to issue cannot be found in the documentation.

Issue

I tried installing Lightning-Kokkos (L-Kokkos) on top of Kokkos (with the CUDA backend). I'm using Perlmutter, which is a GPU cluster of NERSC. Using CUDA-12 (not officially supported by L-Kokkos), I met the following issues

Using CUDA-11, I met the following issues

After these fix, everything can compile, but I get an error

E   RuntimeError: Kokkos::Impl::ParallelReduce< Cuda > requested too much L0 scratch memory

Installing Kokkos from source to go through, trying to execute on Perlmutter's A100 GPUs, I get errors like

Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 7.0 on device with compute capability 8.0 is not supported by CUDA!

unless compiling with -DKokkos_ARCH_AMPERE80=ON. It is not possible to target multiple GPU architectures while building Kokkos. So we should either target something recent, like AMPERE80, or build multiple targets by building different libs.

Installed packages

# packages in environment at /global/u2/v/vincentm/mambaforge/envs/cuda12:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
autograd                  1.5                pyhd8ed1ab_0    conda-forge
autoray                   0.3.1              pyhd8ed1ab_0    conda-forge
brotli                    1.0.9                h166bdaf_8    conda-forge
brotli-bin                1.0.9                h166bdaf_8    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.19.1               hd590300_0    conda-forge
ca-certificates           2023.5.7             hbcca054_0    conda-forge
cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
charset-normalizer        3.1.0              pyhd8ed1ab_0    conda-forge
cmake                     3.26.4               hcfe8598_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
coverage                  7.2.7           py310h2372a71_0    conda-forge
cuda-cccl_linux-64        12.0.90              ha770c72_1    conda-forge
cuda-cudart               12.0.107             h59595ed_5    conda-forge
cuda-cudart-dev           12.0.107             h59595ed_5    conda-forge
cuda-cudart-dev_linux-64  12.0.107             h59595ed_5    conda-forge
cuda-cudart-static        12.0.107             h59595ed_5    conda-forge
cuda-cudart-static_linux-64 12.0.107             h59595ed_5    conda-forge
cuda-cudart_linux-64      12.0.107             h59595ed_5    conda-forge
cuda-driver-dev           12.0.107             h59595ed_5    conda-forge
cuda-driver-dev_linux-64  12.0.107             h59595ed_5    conda-forge
cuda-version              12.0                 hffde075_2    conda-forge
cvxopt                    1.3.1           py310h14a12bf_0    conda-forge
cvxpy                     1.3.1           py310hff52083_1    conda-forge
cvxpy-base                1.3.1           py310h7cbd5c2_1    conda-forge
dsdp                      5.8               hd9d9efa_1203    conda-forge
ecos                      2.0.11          py310h0a54255_0    conda-forge
exceptiongroup            1.1.1              pyhd8ed1ab_0    conda-forge
expat                     2.5.0                hcb278e6_1    conda-forge
fftw                      3.3.10          nompi_hc118613_108    conda-forge
flaky                     3.7.0              pyh9f0ad1d_0    conda-forge
future                    0.18.3             pyhd8ed1ab_0    conda-forge
glpk                      5.0                  h445213a_0    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
gsl                       2.7                  he838d99_0    conda-forge
icu                       72.1                 hcb278e6_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
iniconfig                 2.0.0              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kokkos                    4.0.01               h66b1f04_1    conda-forge
kokkos-kernels            4.0.01               h00ab1b0_0    conda-forge
krb5                      1.20.1               h81ceb04_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libblas                   3.9.0           17_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcblas                  3.9.0           17_linux64_openblas    conda-forge
libcurl                   8.1.2                h409715c_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.1.0               he5830b7_0    conda-forge
libgfortran-ng            13.1.0               h69a702a_0    conda-forge
libgfortran5              13.1.0               h15d22d2_0    conda-forge
libgomp                   13.1.0               he5830b7_0    conda-forge
libhwloc                  2.9.1           nocuda_h7313eea_6    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
liblapack                 3.9.0           17_linux64_openblas    conda-forge
libnghttp2                1.52.0               h61bc06f_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.23          pthreads_h80387f5_0    conda-forge
libosqp                   0.6.3                h59595ed_0    conda-forge
libqdldl                  0.1.5                h27087fc_1    conda-forge
libsqlite                 3.42.0               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.1.0               hfd8a6a1_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libuv                     1.44.2               h166bdaf_0    conda-forge
libxml2                   2.11.4               h0d562d8_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
llvm-openmp               16.0.6               h4dfa4b3_0    conda-forge
metis                     5.1.0             h58526e2_1006    conda-forge
mpfr                      4.2.0                hb012696_0    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
networkx                  3.1                pyhd8ed1ab_0    conda-forge
ninja                     1.11.1               h924138e_0    conda-forge
numpy                     1.23.5          py310h53a5b5f_0    conda-forge
openssl                   3.1.1                hd590300_1    conda-forge
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
osqp                      0.6.3           py310h7cbd5c2_1    conda-forge
packaging                 23.1               pyhd8ed1ab_0    conda-forge
pennylane                 0.30.0          py310hff52083_5    conda-forge
pennylane-core            0.30.0          py310hff52083_5    conda-forge
pennylane-lightning-core  0.30.0          py310h68b9813_2    conda-forge
pennylane-lightning-kokkos 0.31.0.dev1               dev_0    <develop>
pip                       23.1.2             pyhd8ed1ab_0    conda-forge
platformdirs              3.6.0              pyhd8ed1ab_0    conda-forge
pluggy                    1.0.0              pyhd8ed1ab_5    conda-forge
pooch                     1.7.0              pyha770c72_3    conda-forge
pybind11                  2.10.4          py310hdf3cbec_0    conda-forge
pybind11-global           2.10.4          py310hdf3cbec_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
pytest                    7.3.2              pyhd8ed1ab_1    conda-forge
pytest-cov                4.1.0              pyhd8ed1ab_0    conda-forge
pytest-mock               3.11.1             pyhd8ed1ab_0    conda-forge
python                    3.10.0          h543edf9_3_cpython    conda-forge
python_abi                3.10                    3_cp310    conda-forge
qdldl-python              0.1.5.post2     py310h769672d_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
requests                  2.31.0             pyhd8ed1ab_0    conda-forge
rhash                     1.4.3                h166bdaf_0    conda-forge
rustworkx                 0.13.0          py310h47bb294_0    conda-forge
scipy                     1.10.1          py310ha4c1d20_3    conda-forge
scs                       3.2.3           py310heb8e4c9_0    conda-forge
semantic_version          2.10.0             pyhd8ed1ab_0    conda-forge
setuptools                67.7.2             pyhd8ed1ab_0    conda-forge
sqlite                    3.42.0               h2c6b66d_0    conda-forge
suitesparse               5.10.1               h9e50725_1    conda-forge
tbb                       2021.9.0             hf52228f_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
typing-extensions         4.6.3                hd8ed1ab_0    conda-forge
typing_extensions         4.6.3              pyha770c72_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
urllib3                   2.0.3              pyhd8ed1ab_0    conda-forge
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge

Environment info

active environment : cuda12
    active env location : /global/u2/v/vincentm/mambaforge/envs/cuda12
            shell level : 2
       user config file : /global/homes/v/vincentm/.condarc
 populated config files : /global/u2/v/vincentm/mambaforge/.condarc
          conda version : 23.1.0
    conda-build version : not installed
         python version : 3.10.9.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.0=0
                          __glibc=2.31=0
                          __linux=5.14.21=0
                          __unix=0=0
       base environment : /global/u2/v/vincentm/mambaforge  (writable)
      conda av data dir : /global/u2/v/vincentm/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /global/u2/v/vincentm/mambaforge/pkgs
                          /global/homes/v/vincentm/.conda/pkgs
       envs directories : /global/u2/v/vincentm/mambaforge/envs
                          /global/homes/v/vincentm/.conda/envs
               platform : linux-64
             user-agent : conda/23.1.0 requests/2.28.2 CPython/3.10.9 Linux/5.14.21-150400.24.46_12.0.73-cray_shasta_c sles/15.4 glibc/2.31
                UID:GID : 100818:100818
             netrc file : None
           offline mode : False
vincentmr commented 1 year ago

Hi @carterbox , I could help fixing the paths, but I'm curious if you think there is a way forward regarding the GPU architectures. I'm thinking using a multiple-output recipe, one for each arch. Would that work?

carterbox commented 1 year ago

Yes, it would be helpful if you were to work on replacing any build prefix paths with the appropriate paths.

Issue

I tried installing Lightning-Kokkos (L-Kokkos) on top of Kokkos (with the CUDA backend). I'm using Perlmutter, which is a GPU cluster of NERSC. Using CUDA-12 (not officially supported by L-Kokkos), I met the following issues

  • The compiler x86_64-conda-linux-gnu-c++ is set as the Kokkos_CXX_COMPILER. On Perlmutter, the standard C++ compiler is CC. One must modify lib/cmake/Kokkos/KokkosConfigCommon.cmake and bin/kokkos_launch_compiler accordingly.

x86_64-conda-linux-gnu-c++ is the name of the conda-forge provided compiler for x86. Which you can install via the gxx_linux-64 package. The cxx-compiler package is a meta package that will match the platform of the environment. I only intended for downstream users to use the conda-forge provided compilers and this package to compile more conda-forge packages, not to provide kokkos as a build tool for general use.

We could probably use $ENV{CXX} instead for greater compatibility? AFAIK, $CC is usually an environment variable for the c compiler not the c++ compiler.

  • The targets CUDA::cudart, CUDA::cuda_driver have unpatched INTERFACE_INCLUDE_DIRECTORIES still pointing to /home/conda/feedstock_root/build_artifacts/kokkos_1687386826792/_build_env/targets/x86_64-linux/include.

In which file is this path? Maybe this is something that needs to be addressed upstream? CMake already has modules to find these libraries, so I'm not sure whey they need to hardcode the locations in this package.

Installing Kokkos from source to go through, trying to execute on Perlmutter's A100 GPUs, I get errors like

Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 7.0 on device with compute capability 8.0 is not supported by CUDA!

unless compiling with -DKokkos_ARCH_AMPERE80=ON. It is not possible to target multiple GPU architectures while building Kokkos. So we should either target something recent, like AMPERE80, or build multiple targets by building different libs.

The reason that I decided it was OK to ship Kokkos with CUDA enabled is that I realized that we can set the compile options to include PTX with the kokkos shared objects. Thus by compiling for the lowest compute capability (35), the libraries should be compatible with any later devices via JIT compilation by the CUDA driver. This is not best for performance, but since Conda doesn't track CUDA archs, we must build for compatibility. Also, Kokkos doesn't allow targeting more than one CUDA arch because of compile time optimizations.

I'm not sure what options you used, but they are not the same because the conda-forge package is compiled for 35 or 50 depending on CUDA version. I suspect that by default Kokkos does not include PTX.

I'm thinking using a multiple-output recipe, one for each arch. Would that work?

No. Conda doesn't track CUDA archs.

carterbox commented 1 year ago

In which file is this path? Maybe this is something that needs to be addressed upstream? CMake already has modules to find these libraries, so I'm not sure whey they need to hardcode the locations in this package.

I just checked the cmake files in $PREFIX/lib/cmake/Kokkos for kokkos 4.0.01 h1e7fabd_1. They do not mention the build_artifacts prefix anywhere.

Also, looking again at the KokkosConfig.cmake file, it seems that if you use find(CUDAToolkit) (and it has success) in your downstream project, the following blocks are skipped:

IF(NOT TARGET CUDA::cudart)
ADD_LIBRARY(CUDA::cudart UNKNOWN IMPORTED)
SET_TARGET_PROPERTIES(CUDA::cudart PROPERTIES
IMPORTED_LOCATION "/usr/local/cuda/lib64/libcudart.so"
INTERFACE_INCLUDE_DIRECTORIES "/usr/local/cuda/include"
)
ENDIF()
IF(NOT TARGET CUDA::cuda_driver)
ADD_LIBRARY(CUDA::cuda_driver UNKNOWN IMPORTED)
SET_TARGET_PROPERTIES(CUDA::cuda_driver PROPERTIES
IMPORTED_LOCATION "/usr/lib64/libcuda.so"
INTERFACE_INCLUDE_DIRECTORIES "/usr/local/cuda/include"
)
ENDIF()

So I retract my suggestion that something needs to be addressed upstream because the Kokkos developers have given us a way in which to avoid these hardcoded paths.

vincentmr commented 1 year ago

Yes, it would be helpful if you were to work on replacing any build prefix paths with the appropriate paths.

Issue

I tried installing Lightning-Kokkos (L-Kokkos) on top of Kokkos (with the CUDA backend). I'm using Perlmutter, which is a GPU cluster of NERSC. Using CUDA-12 (not officially supported by L-Kokkos), I met the following issues

  • The compiler x86_64-conda-linux-gnu-c++ is set as the Kokkos_CXX_COMPILER. On Perlmutter, the standard C++ compiler is CC. One must modify lib/cmake/Kokkos/KokkosConfigCommon.cmake and bin/kokkos_launch_compiler accordingly.

x86_64-conda-linux-gnu-c++ is the name of the conda-forge provided compiler for x86. Which you can install via the gxx_linux-64 package. The cxx-compiler package is a meta package that will match the platform of the environment. I only intended for downstream users to use the conda-forge provided compilers and this package to compile more conda-forge packages, not to provide kokkos as a build tool for general use.

We could probably use $ENV{CXX} instead for greater compatibility? AFAIK, $CC is usually an environment variable for the c compiler not the c++ compiler.

Got it. Let me try this out. Also, fyi, Perlmutter's compiler wrappers are awkwardly called ftn, cc and CC for Fortran, C and C++ respectively. So one has to type a lot of CC=cc CXX=CC cmake -B build-like commands.

  • The targets CUDA::cudart, CUDA::cuda_driver have unpatched INTERFACE_INCLUDE_DIRECTORIES still pointing to /home/conda/feedstock_root/build_artifacts/kokkos_1687386826792/_build_env/targets/x86_64-linux/include.

In which file is this path? Maybe this is something that needs to be addressed upstream? CMake already has modules to find these libraries, so I'm not sure whey they need to hardcode the locations in this package.

This is in cuda12/lib/cmake/Kokkos/KokkosConfig.cmake.

Installing Kokkos from source to go through, trying to execute on Perlmutter's A100 GPUs, I get errors like

Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 7.0 on device with compute capability 8.0 is not supported by CUDA!

unless compiling with -DKokkos_ARCH_AMPERE80=ON. It is not possible to target multiple GPU architectures while building Kokkos. So we should either target something recent, like AMPERE80, or build multiple targets by building different libs.

The reason that I decided it was OK to ship Kokkos with CUDA enabled is that I realized that we can set the compile options to include PTX with the kokkos shared objects. Thus by compiling for the lowest compute capability (35), the libraries should be compatible with any later devices via JIT compilation by the CUDA driver. This is not best for performance, but since Conda doesn't track CUDA archs, we must build for compatibility. Also, Kokkos doesn't allow targeting more than one CUDA arch because of compile time optimizations.

I'll check whether we can do something with PTX.

carterbox commented 1 year ago

This is in cuda12/lib/cmake/Kokkos/KokkosConfig.cmake.

Oh! These unpatched build prefixes are only for the CUDA 12 package. We should probably still patch them or replace this target with an error telling the user to use find(CUDA_Toolkit)

carterbox commented 1 year ago

I believe that PTX forward compatability may only be available for kokkos 4, so I will probably have to pull the CUDA builds for 3.x.

https://github.com/kokkos/kokkos/issues/5439 https://github.com/kokkos/kokkos/issues/3612

vincentmr commented 1 year ago

I think that's right. I found this PR, which removes the blocking condition. I don't know enough about CUDA and Kokkos to know whether Kokkos really supports this forward compatibility in the end.

vincentmr commented 1 year ago

With CUDA-12, trying to find(CUDA_Toolkit), I have the following issue

    CMake Warning at CMakeLists.txt:109 (find_package):
      By not providing "FindCUDA_Toolkit.cmake" in CMAKE_MODULE_PATH this project
      has asked CMake to find a package configuration file provided by
      "CUDA_Toolkit", but CMake did not find one.

      Could not find a package configuration file provided by "CUDA_Toolkit" with
      any of the following names:

        CUDA_ToolkitConfig.cmake
        cuda_toolkit-config.cmake

which can be resolved with cp libcudacxx-config.cmake CUDA_ToolkitConfig.cmake. Should we create a symlink so that the file has the name expected by CMake? Then

    -- Found libcudacxx: /global/homes/v/vincentm/mambaforge/envs/cuda12/targets/x86_64-linux/lib/cmake/libcudacxx/CUDA_ToolkitConfig.cmake
    -- Found existing Kokkos libraries
    -- pybind11 v2.10.1
    -- Configuring done (0.7s)
    CMake Error in pennylane_lightning_kokkos/src/simulator/CMakeLists.txt:
      Imported target "Kokkos::kokkos" includes non-existent path

        "/home/conda/feedstock_root/build_artifacts/kokkos_1687972725614/_build_env/targets/x86_64-linux/include"

      in its INTERFACE_INCLUDE_DIRECTORIES.

which is the problem described above. I have the following hits

# grep -r kokkos_1687972725614 .
./lib/cmake/Kokkos/KokkosConfig.cmake:INTERFACE_INCLUDE_DIRECTORIES "/home/conda/feedstock_root/build_artifacts/kokkos_1687972725614/_build_env/targets/x86_64-linux/include"
./lib/cmake/Kokkos/KokkosConfig.cmake:INTERFACE_INCLUDE_DIRECTORIES "/home/conda/feedstock_root/build_artifacts/kokkos_1687972725614/_build_env/targets/x86_64-linux/include"

After fixing them it compiles. Since we are already patching the lib paths, I think we should patch the include paths as well.

carterbox commented 1 year ago

CUDAToolkit not CUDA_Toolkit, and CMake 3.17 or later.

vincentmr commented 1 year ago

Thanks @carterbox for pointing this out. I'm using

> cmake --version
cmake version 3.26.4

CMake finds CUDAToolkit now. So one needs to insert find_package(CUDAToolkit) before find_package(Kokkos). Should we then remove the sed patch altogether, or still patch the includes too?

I also get the following runtime error

E   RuntimeError: Kokkos::Impl::ParallelReduce< Cuda > requested too much L0 scratch memory

when trying to do anything. This is unrelated, but I was wondering whether you could successfully run some test?

carterbox commented 1 year ago

when trying to do anything. This is unrelated, but I was wondering whether you could successfully run some test?

I tried building and running the project in the examples (build_cmake_installed). First, I created the following environment:

mamba create -n kokkos cxx-compiler cuda-compiler fortran-compiler cmake ninja kokkos=4

Then I edited the example to include find(CUDAToolkit) before finding kokkos.

Finally, I configured the and built the project using cmake and ninja.

Running the example works.

Did the same with the CUDA 11.2 build of Kokkos 4 by creating the following environment:

mamba create -n kokkos2 cxx-compiler gxx=10 kokkos=4 cuda-version=11.2 cmake ninja fortran-compiler

Also works. I do get warning messages about performance because the conda-forge packages were built for SM35 and SM50, but my device is SM61.

Should we then remove the sed patch altogether, or still patch the includes too?

I think we should replace the hardcoded paths with an error that the user should use find(CUDAToolkit). Something like:

IF(NOT TARGET CUDA::cudart)
MESSAGE(FATAL_ERROR,"The CUDA::cudart target was not found; use find_package(CUDAToolkit REQUIRED) before find_package(Kokkos).")
ENDIF()
IF(NOT TARGET CUDA::cuda_driver)
MESSAGE(FATAL_ERROR, "The CUDA::cuda_driver target was not found; use find_package(CUDAToolkit REQUIRED) before find_package(Kokkos).")
ENDIF()

It's probably easier than trying to guess about the end user's conda environment.

vincentmr commented 1 year ago

Running the example works.

Good. I guess there is an issue with my examples (too large maybe?).

I think we should replace the hardcoded paths with an error that the user should use find(CUDAToolkit). Something like:

I like that solution. I think we can close this and implement your fix.