BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
875 stars 83 forks source link

Aluminium compilation error #561

Closed mmathys closed 2 years ago

mmathys commented 2 years ago

Describe the bug I get a compilation error related to NCCL in the Aluminium third party module. What is a possible fix?

I installed the dependencies according to the tutorial (see exact commands below). I also initialized the Bagua core module (not in the documentation, advised by Yafen).

Environment

Reproducing

Starting point: Ubuntu 20.04, with CUDA 11.03 (nvcc --version: 11.3)

Run setup script:

#!/bin/sh
cd /home/ubuntu
source .bashrc

set -eux

exit_and_error() {
    echo "Auto installation is supported only on Ubuntu(18.04) or CentOs(7,8), abort."
    exit
}

check_os_version() {
    OS_NAME=$(grep ^NAME /etc/os-release | awk -F'"' '{print $2}')
    VERSION_ID=$(grep ^VERSION_ID /etc/os-release | awk -F'"' '{print $2}')
    echo "Current OS is "${OS_NAME}", Version is "${VERSION_ID}
    if [ "$OS_NAME" == "Ubuntu" ]; then
        if [[ $VERSION_ID != @("18.04"|"20.04") ]]; then
            exit_and_error
        fi
    elif [ "$OS_NAME" == "CentOS Linux" ]; then
        if [[ $VERSION_ID != @("7"|"8") ]]; then
            exit_and_error
        fi
    else
        exit_and_error
    fi
}

# upgrade to python3.8
confirm() {
    # call with a prompt string or use a default
    echo "Your Python version is $(python3 -V), but Bagua requires Python version >= 3.7."
    read -r -p "${1:-Do you want to upgrade Python? [Y/n]} " response
    case "$response" in
    [yY][eE][sS] | [yY])
        echo "True"
        ;;
    *)
        echo "False"
        ;;
    esac

}

upgrade_python() {
    if [ "$OS_NAME" == "Ubuntu" ]; then
        sudo apt-get install -y python3.8 python3.8-distutils python3.8-dev
    elif [ "$OS_NAME" == "CentOS Linux" ]; then
        mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp wget https://www.python.org/ftp/python/3.8.12/Python-3.8.12.tgz &&
            tar xvf /var/tmp/Python-3.8.12.tgz &&
            cd /var/tmp/Python-3.8.12 && ./configure --enable-optimizations --prefix=/usr && make altinstall &&
            rm -rf /var/tmp/Python-3.8.12.tgz /var/tmp/Python-3.8.12 && cd -
    fi
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
}

check_python_version() {
    PYTHON_VERSION_OK=$(python3 -c 'import sys; print(int(sys.version_info > (3, 7)))')
    if [[ $PYTHON_VERSION_OK -eq 0 ]]; then
        confirm && upgrade_python
    fi

}

check_os_version

# install necessary packages
if [ "$OS_NAME" == "Ubuntu" ]; then
    # remove cmake
    sudo apt remove --purge --auto-remove -y cmake

    # install python3-pip
    sudo apt-get update && DEBIAN_FRONTEND=noninteractive sudo apt-get install -y curl software-properties-common wget
    sudo apt-get update && DEBIAN_FRONTEND=noninteractive sudo apt-get install -y python3-pip zlib1g-dev libssl-dev

elif [ "$OS_NAME" == "CentOS Linux" ]; then
    if [ $VERSION_ID == "7" ]; then
        yum remove cmake3 -y
    elif [ $VERSION_ID == "8" ]; then
        yum remove cmake -y
    fi

    yum install -y wget curl bzip2 perl zlib-devel openssl-devel
fi

check_python_version

# install some utils
python3 -m pip install --upgrade pip -i https://pypi.org/simple
python3 -m pip install setuptools-rust colorama tqdm wheel -i https://pypi.org/simple

# install cmake 3.22.1
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://github.com/Kitware/CMake/releases/download/v3.22.1/cmake-3.22.1-linux-x86_64.sh &&
    cd /var/tmp && chmod +x cmake-3.22.1-linux-x86_64.sh &&
    sudo sh cmake-3.22.1-linux-x86_64.sh --prefix=/usr --skip-license &&
    rm -rf /var/tmp/cmake-3.22.1-linux-x86_64.sh && cd -

# install hwloc 2.7.0
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://download.open-mpi.org/release/hwloc/v2.7/hwloc-2.7.0.tar.bz2 &&
    tar -x -f /var/tmp/hwloc-2.7.0.tar.bz2 -C /var/tmp -j &&
    cd /var/tmp/hwloc-2.7.0 && ./configure &&
    make -j$(nproc) &&
    sudo make -j$(nproc) install &&
    rm -rf /var/tmp/hwloc* && cd -

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/lib

# install openmpi 4.1.2
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2 &&
    tar -x -f /var/tmp/openmpi-4.1.2.tar.bz2 -C /var/tmp -j &&
    cd /var/tmp/openmpi-4.1.2 && ./configure --disable-getpwuid --disable-oshmem --enable-fortran --enable-mca-no-build=btl-uct --enable-orterun-prefix-by-default --with-cuda --without-verbs &&
    make -j$(nproc) &&
    sudo make -j$(nproc) install &&
    rm -rf /var/tmp/openmpi-4.1.2 /var/tmp/openmpi-4.1.2.tar.bz2 && cd -

# install rust
if ! command -v cargo &>/dev/null; then
    curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain stable -y
    export PATH="$HOME/.cargo/bin:$PATH"
fi

# clone repository
git clone --recurse-submodules https://github.com/BaguaSys/bagua.git

# install bagua core dependencies
cd bagua/bagua_core
python3 bagua_install_deps.py

# set required flags
export BAGUA_NO_INSTALL_DEPS=1
export LIBRARY_PATH="~/.local/share/bagua/nccl/lib:$LIBRARY_PATH"
cd ..

# execute setup
python3 setup.py install --user

This gives me the following error:

     Running `/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-565ed2c99881c59f/build-script-build`
The following warnings were emitted during compilation:

warning: nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
warning: nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

error: failed to run custom build command for `bagua-core-internal v0.1.2 (/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal)`

Caused by:
  process didn't exit successfully: `/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-565ed2c99881c59f/build-script-build` (exit status: 101)
  --- stdout
  TARGET = Some("x86_64-unknown-linux-gnu")
  OPT_LEVEL = Some("3")
  HOST = Some("x86_64-unknown-linux-gnu")
  CXX_x86_64-unknown-linux-gnu = None
  CXX_x86_64_unknown_linux_gnu = None
  HOST_CXX = None
  CXX = None
  NVCC_x86_64-unknown-linux-gnu = None
  NVCC_x86_64_unknown_linux_gnu = None
  HOST_NVCC = None
  NVCC = None
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  HOST_CXXFLAGS = None
  CXXFLAGS = None
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("false")
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "cpp/include" "-I" "third_party/cub-1.8.0" "-I" "/home/ubuntu/.local/share/bagua/nccl/include" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-std=c++14" "-cudart=shared" "-gencode" "arch=compute_35,code=sm_35" "-gencode" "arch=compute_37,code=sm_37" "-gencode" "arch=compute_50,code=sm_50" "-gencode" "arch=compute_52,code=sm_52" "-gencode" "arch=compute_53,code=sm_53" "-gencode" "arch=compute_60,code=sm_60" "-gencode" "arch=compute_61,code=sm_61" "-gencode" "arch=compute_62,code=sm_62" "-gencode" "arch=compute_70,code=sm_70" "-gencode" "arch=compute_72,code=sm_72" "-gencode" "arch=compute_75,code=sm_75" "-gencode" "arch=compute_80,code=sm_80" "-gencode" "arch=compute_86,code=sm_86" "-o" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/kernels/bagua_kernels.o" "-c" "kernels/bagua_kernels.cu"
  cargo:warning=nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  exit status: 0
  AR_x86_64-unknown-linux-gnu = None
  AR_x86_64_unknown_linux_gnu = None
  HOST_AR = None
  AR = None
  running: "ar" "cq" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/kernels/bagua_kernels.o"
  exit status: 0
  running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "cpp/include" "-I" "third_party/cub-1.8.0" "-I" "/home/ubuntu/.local/share/bagua/nccl/include" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-std=c++14" "-cudart=shared" "-gencode" "arch=compute_35,code=sm_35" "-gencode" "arch=compute_37,code=sm_37" "-gencode" "arch=compute_50,code=sm_50" "-gencode" "arch=compute_52,code=sm_52" "-gencode" "arch=compute_53,code=sm_53" "-gencode" "arch=compute_60,code=sm_60" "-gencode" "arch=compute_61,code=sm_61" "-gencode" "arch=compute_62,code=sm_62" "-gencode" "arch=compute_70,code=sm_70" "-gencode" "arch=compute_72,code=sm_72" "-gencode" "arch=compute_75,code=sm_75" "-gencode" "arch=compute_80,code=sm_80" "-gencode" "arch=compute_86,code=sm_86" "--device-link" "-o" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/bagua_kernels_dlink.o" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a"
  cargo:warning=nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  exit status: 0
  running: "ar" "cq" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/bagua_kernels_dlink.o"
  exit status: 0
  running: "ar" "s" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a"
  exit status: 0
  cargo:rustc-link-lib=static=bagua_kernels
  cargo:rustc-link-search=native=/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out
  CXXSTDLIB_x86_64-unknown-linux-gnu = None
  CXXSTDLIB_x86_64_unknown_linux_gnu = None
  HOST_CXXSTDLIB = None
  CXXSTDLIB = None
  cargo:rustc-link-lib=stdc++
  cargo:rustc-link-search=native=/usr/local/cuda/bin/../targets/x86_64-linux/lib
  cargo:rustc-link-lib=cudart_static
  CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
  CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_TOOLCHAIN_FILE = None
  CMAKE_TOOLCHAIN_FILE = None
  CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
  CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_GENERATOR = None
  CMAKE_GENERATOR = None
  CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
  CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_PREFIX_PATH = None
  CMAKE_PREFIX_PATH = None
  CMAKE_x86_64-unknown-linux-gnu = None
  CMAKE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE = None
  CMAKE = None
  running: "cmake" "/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/Aluminum" "-DCMAKE_CXX_STANDARD=17" "-DALUMINUM_ENABLE_NCCL=YES" "-DCUB_INCLUDE_PATH=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/cub-1.8.0" "-DNCCL_LIBRARY=/home/ubuntu/.local/share/bagua/nccl/lib/libnccl.so" "-DNCCL_INCLUDE_PATH=/home/ubuntu/.local/share/bagua/nccl/include" "-DBUILD_SHARED_LIBS=off" "-DCMAKE_INSTALL_PREFIX=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/../../../bagua_core/.data" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/usr/bin/cc" "-DCMAKE_CXX_FLAGS= -std=c++17 -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/usr/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"

  --- stderr
  thread 'main' panicked at '
  failed to execute command: No such file or directory (os error 2)
  is `cmake` not installed?

  build script failed, must exit now', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/cmake-0.1.48/src/lib.rs:975:5
  stack backtrace:
     0: rust_begin_unwind
               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
     1: core::panicking::panic_fmt
               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
     2: cmake::fail
     3: cmake::run
     4: cmake::Config::build
     5: build_script_build::main
     6: core::ops::function::FnOnce::call_once
  note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: cargo failed with code: 101

Cmake is actually installed (cmake --version works). Have a look at the command:

"cmake" "/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/Aluminum" "-DCMAKE_CXX_STANDARD=17" "-DALUMINUM_ENABLE_NCCL=YES" "-DCUB_INCLUDE_PATH=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/cub-1.8.0" "-DNCCL_LIBRARY=/home/ubuntu/.local/share/bagua/nccl/lib/libnccl.so" "-DNCCL_INCLUDE_PATH=/home/ubuntu/.local/share/bagua/nccl/include" "-DBUILD_SHARED_LIBS=off" "-DCMAKE_INSTALL_PREFIX=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/../../../bagua_core/.data" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/usr/bin/cc" "-DCMAKE_CXX_FLAGS= -std=c++17 -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/usr/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"

Please let me know if this error is reproducible. I'd appreciate any tips on how to fix this.

Thanks, Max

woqidaideshi commented 2 years ago

Can you find out where your cmake is installed? Maybe you need to add its path to the environment variable. For example, the binary file cmake is in this directory: /usr/local/bin/. Then execute the following command to add the path to the environment variable:

export PATH=/usr/local/bin:$PATH