bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

cuda module build failure - cuda-12.3.1, cudnn-8.9.7.29, nccl-2.19.3, nvcomp-3.0.5 #1454

Closed archenroot closed 8 months ago

archenroot commented 8 months ago

@saudet - hi buddy,

I have seen your latest commit which upgrades to the latest Nvidia Cuda stack, I installed the following files:

 zangetsu  X10SRA  ~  ls devel/sdk/cuda-12
drwxrwxr-x 2 zangetsu zangetsu 4,0K 2023-12-28 20:26 .
drwxrwxr-x 5 zangetsu zangetsu 4,0K 2023-12-28 20:45 ..
-rw-rw-r-- 1 zangetsu zangetsu 4,3K 2022-04-22 11:14 cuda-keyring_1.0-1_all.deb
-rw-rw-r-- 1 zangetsu zangetsu 3,1G 2023-11-09 07:53 cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
-rw-rw-r-- 1 zangetsu zangetsu 846M 2023-12-28 19:34 cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
-rw-rw-r-- 1 zangetsu zangetsu 192M 2023-12-28 19:56 nccl-local-repo-ubuntu2204-2.19.3-cuda12.3_1.0-1_amd64.deb
-rw-rw-r-- 1 zangetsu zangetsu  22M 2023-12-04 19:28 nvcomp_3.0.5_x86_64_12.x.tgz

I installed it via following procedure taken from Nvidia install guides:

sudo apt-get -y install cuda-toolkit-12-3
sudo apt install  libcudnn8-dev=8.9.7.29-1+cuda12.2 libcudnn8-samples=8.9.7.29-1+cuda12.2
sudo apt install libnccl2=2.19.3-1+cuda12.3 libnccl-dev=2.19.3-1+cuda12.3
sudo tar -xvf nvcomp_*.tgz -C /usr/local/cuda/include/ --strip-components=1 include/
sudo tar -xvf nvcomp_*.tgz -C /usr/local/cuda/lib64/ --strip-components=1 lib/

when I query for stack installed I get for cuda:

 zangetsu  X10SRA  …/public/javacpp-presets  master ◔  sudo apt search cuda |grep installed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cuda-cccl-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-command-line-tools-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-compiler-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-crt-12-3/unknown,now 12.3.103-1 amd64 [installed,automatic]
cuda-cudart-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-cudart-dev-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-cuobjdump-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-cupti-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-cupti-dev-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-cuxxfilt-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-documentation-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-driver-dev-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-drivers/unknown,now 545.23.08-1 amd64 [installed]
cuda-drivers-545/unknown,now 545.23.08-1 amd64 [installed,automatic]
cuda-gdb-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-keyring/unknown,now 1.1-1 all [installed]
cuda-libraries-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-libraries-dev-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-nsight-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-nsight-compute-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-nsight-systems-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-nvcc-12-3/unknown,now 12.3.103-1 amd64 [installed,automatic]
cuda-nvdisasm-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-nvml-dev-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-nvprof-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-nvprune-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-nvrtc-12-3/unknown,now 12.3.103-1 amd64 [installed,automatic]
cuda-nvrtc-dev-12-3/unknown,now 12.3.103-1 amd64 [installed,automatic]
cuda-nvtx-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-nvvm-12-3/unknown,now 12.3.103-1 amd64 [installed,automatic]
cuda-nvvp-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-opencl-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-opencl-dev-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-profiler-api-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-repo-ubuntu2204-12-3-local/now 12.3.1-545.23.08-1 amd64 [installed,local]
cuda-sanitizer-12-3/unknown,now 12.3.101-1 amd64 [installed,automatic]
cuda-toolkit-12-3/unknown,now 12.3.1-1 amd64 [installed]
cuda-toolkit-12-3-config-common/unknown,now 12.3.101-1 all [installed,automatic]
cuda-toolkit-12-config-common/unknown,now 12.3.101-1 all [installed,automatic]
cuda-toolkit-config-common/unknown,now 12.3.101-1 all [installed,automatic]
cuda-tools-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
cuda-visual-tools-12-3/unknown,now 12.3.1-1 amd64 [installed,automatic]
libcufile-12-3/unknown,now 1.8.1.2-1 amd64 [installed,automatic]
libcusolver-12-3/unknown,now 11.5.4.101-1 amd64 [installed,automatic]
libcusolver-dev-12-3/unknown,now 11.5.4.101-1 amd64 [installed,automatic]
libnvidia-compute-545/unknown,now 545.23.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-decode-545/unknown,now 545.23.08-0ubuntu1 amd64 [installed,automatic]
nsight-compute-2023.3.1/unknown,now 2023.3.1.1-1 amd64 [installed,automatic]
nvidia-compute-utils-545/unknown,now 545.23.08-0ubuntu1 amd64 [installed,automatic]

and for cudnn

zangetsu  X10SRA  …/public/javacpp-presets  master ◔  sudo apt search cudnn |grep installed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cudnn-local-repo-ubuntu2204-8.9.7.29/now 1.0-1 amd64 [installed,local]
libcudnn8/unknown,now 8.9.7.29-1+cuda12.2 amd64 [installed]
libcudnn8-dev/unknown,now 8.9.7.29-1+cuda12.2 amd64 [installed]
libcudnn8-samples/unknown,now 8.9.7.29-1+cuda12.2 amd64 [installed]

I am on Ubuntu 22.04.3 LTS. my nvcc:

 zangetsu  X10SRA  ~  nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

cudnn:

zangetsu  X10SRA  ~  cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 7

All above seems to match your commit https://github.com/bytedeco/javacpp-presets/commit/a885e6c58268eee4510525c9597c5b90a169d7a2

but my local mvn install fails. javacpp-presets-cuda-12.3-failure.txt

Seems to me I have something totally wrong against your config in GitHub actions as soon as your commit passed the CICD.

saudet commented 8 months ago

If it works with GitHub Actions, that's alright

archenroot commented 8 months ago

Ok, I will try to look at GitHub actions configuration to replicate it locally.

archenroot commented 8 months ago

@saudet - ok I looked into the github action script and extracted only related content and just testing if it works:


export SUDO=$(which sudo)

mkdir -p .ccache
echo "max_size = 2.0G"                                                                        > .ccache/ccache.conf
echo "hash_dir = false"                                                                      >> .ccache/ccache.conf
echo "sloppiness = file_macro,include_file_ctime,include_file_mtime,pch_defines,time_macros" >> .ccache/ccache.conf

export ARCH=amd64
export PREFIX=x86_64-linux-gnu
export ARCH_CUDA=x86_64
export CUDA=cuda-repo-rhel8-12-3-local-12.3.1_545.23.08-1.x86_64.rpm
export CUDNN=8.9.7.29-1.cuda12.2.x86_64
export NCCL=2.19.3-1+cuda12.3.x86_64
export NVCOMP=nvcomp_3.0.5_x86_64_12.x

$SUDO dpkg --list
$SUDO apt-get update
$SUDO apt-get -y install gnupg
source /etc/os-release

export CODENAME=$UBUNTU_CODENAME
if [[ ! "$ARCH" == "amd64" ]]; then
    # https://github.com/actions/runner-images/issues/675
    $SUDO gem install apt-spy2
    $SUDO apt-spy2 check
    $SUDO apt-spy2 fix --commit
    $SUDO sed -i 's/azure\.//' /etc/apt/apt-mirrors.txt /etc/apt/sources.list
    $SUDO cat /etc/apt/apt-mirrors.txt /etc/apt/sources.list
    $SUDO apt-get update

    # https://github.com/actions/runner-images/issues/4589
    $SUDO apt-add-repository -y ppa:ondrej/php
    $SUDO apt-get -y install ppa-purge
    $SUDO ppa-purge -y ppa:ondrej/php
fi

if [[ "$ARCH" == "i386" ]]; then
    $SUDO dpkg --add-architecture $ARCH
    TOOLCHAIN="gcc-$PREFIX g++-$PREFIX gfortran-$PREFIX"
elif [[ ! "$ARCH" == "amd64" ]]; then
    $SUDO dpkg --add-architecture $ARCH
    $SUDO sed -i 's/deb http/deb [arch=amd64] http/g' /etc/apt/sources.list
    $SUDO echo deb [arch=$ARCH] http://ports.ubuntu.com/ubuntu-ports $CODENAME main restricted universe multiverse | $SUDO tee -a /etc/apt/sources.list
    $SUDO echo deb [arch=$ARCH] http://ports.ubuntu.com/ubuntu-ports $CODENAME-updates main restricted universe multiverse | $SUDO tee -a /etc/apt/sources.list
    $SUDO echo deb [arch=$ARCH] http://ports.ubuntu.com/ubuntu-ports $CODENAME-backports main restricted universe multiverse | $SUDO tee -a /etc/apt/sources.list
    $SUDO echo deb [arch=$ARCH] http://ports.ubuntu.com/ubuntu-ports $CODENAME-security main restricted universe multiverse | $SUDO tee -a /etc/apt/sources.list
    TOOLCHAIN="gcc-$PREFIX g++-$PREFIX gfortran-$PREFIX linux-libc-dev-$ARCH-cross binutils-multiarch"
fi

$SUDO apt-get update
$SUDO apt-get -y install gcc-multilib g++-multilib gfortran-multilib python3 python2.7 python3-minimal python2.7-minimal rpm libasound2-dev:$ARCH freeglut3-dev:$ARCH libfontconfig-dev:$ARCH libgtk2.0-dev:$ARCH libusb-dev:$ARCH libusb-1.0-0-dev:$ARCH libffi-dev:$ARCH libbz2-dev:$ARCH zlib1g-dev:$ARCH libxcb1-dev:$ARCH
$SUDO apt-get -y install pkg-config ccache clang $TOOLCHAIN openjdk-8-jdk ant python2 python3-pip swig git file wget unzip tar bzip2 gzip patch autoconf-archive autogen automake cmake make libtool bison flex perl nasm ragel curl libcurl4-openssl-dev libssl-dev libffi-dev libbz2-dev zlib1g-dev rapidjson-dev
$SUDO python3 -m pip install gdown || $SUDO python3 -m pip install gdown

echo "ARCH=$ARCH" >> $GITHUB_ENV
echo "PREFIX=$PREFIX" >> $GITHUB_ENV

echo Installing CUDA, cuDNN, nvCOMP, etc
curl -LO https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/$CUDA
curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libcudnn8-$CUDNN.rpm
curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libcudnn8-devel-$CUDNN.rpm
curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libnccl-$NCCL.rpm
curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libnccl-devel-$NCCL.rpm

$SUDO rpm -i --force --ignorearch --nodeps $CUDA libcudnn*.rpm libnccl*.rpm
rm -f *.rpm *.tgz *.txz *.tar.*
pushd /var/cuda-repo-rhel8-12-3-local/; $SUDO rpm -i --force --ignorearch --nodeps cuda*.rpm libc*.rpm libn*.rpm; $SUDO rm *.rpm; popd
$SUDO ln -sf /usr/local/cuda/lib64/ /usr/local/cuda/lib
$SUDO ln -sf /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/libcuda.so
$SUDO ln -sf /usr/local/cuda/lib64/stubs/libnvidia-ml.so /usr/local/cuda/lib64/libnvidia-ml.so
$SUDO mv /usr/include/cudnn* /usr/include/nccl* /usr/local/cuda/include/
$SUDO mv /usr/lib64/libcudnn* /usr/lib64/libnccl* /usr/local/cuda/lib64/

if [[ -n ${NVCOMP:-} ]]; then
    echo "installing nvcomp"
    curl -LO https://developer.download.nvidia.com/compute/nvcomp/3.0.5/local_installers/$NVCOMP.tgz
    $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/lib64/ --strip-components=1 lib/ || $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/lib64/ --strip-components=2 nvcomp-3.0.5-ctk-12.2/lib/
    $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/include/ --strip-components=1 include/ || $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/include/ --strip-components=2 nvcomp-3.0.5-ctk-12.2/include/
    rm -f $NVCOMP.tgz
fi
archenroot commented 8 months ago

@saudet - the only strange fact is that you use RPM (CentOS, Fedora, RHEL) packages on Ubuntu (where we have Debian deb). Don't you want to recreate this flow to be based on the latest LTS Ubuntu image, ie. 22.04 LTS and Nividia DEB packages installation?

saudet commented 8 months ago

They only have RHEL packages for ppc64le...

archenroot commented 8 months ago

I look at deploy-ubuntu/action.yml and see this:

name: Deploy on Ubuntu
runs:
  using: composite
...
...
 if [[ -n ${ARCH_CUDA:-} ]] && [[ -n ${CI_DEPLOY_NEED_CUDA:-} ]]; then
          echo Installing CUDA, cuDNN, nvCOMP, etc
          curl -LO https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/$CUDA
          curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libcudnn8-$CUDNN.rpm
          curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libcudnn8-devel-$CUDNN.rpm
          curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libnccl-$NCCL.rpm
          curl -LO https://developer.download.nvidia.com/compute/cuda/repos/rhel8/$ARCH_CUDA/libnccl-devel-$NCCL.rpm

          $SUDO rpm -i --force --ignorearch --nodeps $CUDA libcudnn*.rpm libnccl*.rpm
          rm -f *.rpm *.tgz *.txz *.tar.*
          pushd /var/cuda-repo-rhel8-12-3-local/; $SUDO rpm -i --force --ignorearch --nodeps cuda*.rpm libc*.rpm libn*.rpm; $SUDO rm *.rpm; popd
          $SUDO ln -sf /usr/local/cuda/lib64/ /usr/local/cuda/lib
          $SUDO ln -sf /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/libcuda.so
          $SUDO ln -sf /usr/local/cuda/lib64/stubs/libnvidia-ml.so /usr/local/cuda/lib64/libnvidia-ml.so
          $SUDO mv /usr/include/cudnn* /usr/include/nccl* /usr/local/cuda/include/
          $SUDO mv /usr/lib64/libcudnn* /usr/lib64/libnccl* /usr/local/cuda/lib64/

          if [[ -n ${NVCOMP:-} ]]; then
            curl -LO https://developer.download.nvidia.com/compute/nvcomp/3.0.5/local_installers/$NVCOMP.tgz
            $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/lib64/ --strip-components=1 lib/ || $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/lib64/ --strip-components=2 nvcomp-3.0.5-ctk-12.2/lib/
            $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/include/ --strip-components=1 include/ || $SUDO tar -xvf $NVCOMP.tgz -C /usr/local/cuda/include/ --strip-components=2 nvcomp-3.0.5-ctk-12.2/include/
            rm -f $NVCOMP.tgz
          fi

So cuda build is using RPM on Ubuntu...do I miss something from github action?

archenroot commented 8 months ago

I cannot see neither local or network DEB nvidia packages installed in that ubuntu deployment shell script... thx for helping me understand

saudet commented 8 months ago

If you can find Ubuntu packages for ppc64le, sure

archenroot commented 8 months ago

aha, these doesn't exists... thx for hint.