BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
872 stars 83 forks source link

Error installing dependencies for MNIST example #553

Closed mmathys closed 2 years ago

mmathys commented 2 years ago

Describe the bug

I am trying to launch the MNIST example on a single machine on AWS. The dependencies fail to install. Likely to do something with Rust. Am I missing dependencies?

Environment

Reproducing

           Running `/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-65d17bb237c18142/build-script-build`
      The following warnings were emitted during compilation:

      warning: nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
      warning: nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

      error: failed to run custom build command for `bagua-core-internal v0.1.2 (/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal)`

      Caused by:
        process didn't exit successfully: `/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-65d17bb237c18142/build-script-build` (exit status: 101)
        --- stdout
        TARGET = Some("x86_64-unknown-linux-gnu")
        OPT_LEVEL = Some("3")
        HOST = Some("x86_64-unknown-linux-gnu")
        CXX_x86_64-unknown-linux-gnu = None
        CXX_x86_64_unknown_linux_gnu = None
        HOST_CXX = None
        CXX = None
        NVCC_x86_64-unknown-linux-gnu = None
        NVCC_x86_64_unknown_linux_gnu = None
        HOST_NVCC = None
        NVCC = None
        CXXFLAGS_x86_64-unknown-linux-gnu = None
        CXXFLAGS_x86_64_unknown_linux_gnu = None
        HOST_CXXFLAGS = None
        CXXFLAGS = None
        CRATE_CC_NO_DEFAULTS = None
        DEBUG = Some("false")
        CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
        running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "cpp/include" "-I" "third_party/cub-1.8.0" "-I" "/home/ubuntu/.local/share/bagua/nccl/include" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-std=c++14" "-cudart=shared" "-gencode" "arch=compute_35,code=sm_35" "-gencode" "arch=compute_37,code=sm_37" "-gencode" "arch=compute_50,code=sm_50" "-gencode" "arch=compute_52,code=sm_52" "-gencode" "arch=compute_53,code=sm_53" "-gencode" "arch=compute_60,code=sm_60" "-gencode" "arch=compute_61,code=sm_61" "-gencode" "arch=compute_62,code=sm_62" "-gencode" "arch=compute_70,code=sm_70" "-gencode" "arch=compute_72,code=sm_72" "-gencode" "arch=compute_75,code=sm_75" "-gencode" "arch=compute_80,code=sm_80" "-gencode" "arch=compute_86,code=sm_86" "-o" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/kernels/bagua_kernels.o" "-c" "kernels/bagua_kernels.cu"
        cargo:warning=nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
        exit status: 0
        AR_x86_64-unknown-linux-gnu = None
        AR_x86_64_unknown_linux_gnu = None
        HOST_AR = None
        AR = None
        running: "ar" "cq" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/libbagua_kernels.a" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/kernels/bagua_kernels.o"
        exit status: 0
        running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "cpp/include" "-I" "third_party/cub-1.8.0" "-I" "/home/ubuntu/.local/share/bagua/nccl/include" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-std=c++14" "-cudart=shared" "-gencode" "arch=compute_35,code=sm_35" "-gencode" "arch=compute_37,code=sm_37" "-gencode" "arch=compute_50,code=sm_50" "-gencode" "arch=compute_52,code=sm_52" "-gencode" "arch=compute_53,code=sm_53" "-gencode" "arch=compute_60,code=sm_60" "-gencode" "arch=compute_61,code=sm_61" "-gencode" "arch=compute_62,code=sm_62" "-gencode" "arch=compute_70,code=sm_70" "-gencode" "arch=compute_72,code=sm_72" "-gencode" "arch=compute_75,code=sm_75" "-gencode" "arch=compute_80,code=sm_80" "-gencode" "arch=compute_86,code=sm_86" "--device-link" "-o" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/bagua_kernels_dlink.o" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/libbagua_kernels.a"
        cargo:warning=nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
        exit status: 0
        running: "ar" "cq" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/libbagua_kernels.a" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/bagua_kernels_dlink.o"
        exit status: 0
        running: "ar" "s" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out/libbagua_kernels.a"
        exit status: 0
        cargo:rustc-link-lib=static=bagua_kernels
        cargo:rustc-link-search=native=/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/target/release/build/bagua-core-internal-a571c95913d0ee58/out
        CXXSTDLIB_x86_64-unknown-linux-gnu = None
        CXXSTDLIB_x86_64_unknown_linux_gnu = None
        HOST_CXXSTDLIB = None
        CXXSTDLIB = None
        cargo:rustc-link-lib=stdc++
        cargo:rustc-link-search=native=/usr/local/cuda/bin/../targets/x86_64-linux/lib
        cargo:rustc-link-lib=cudart_static
        CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
        CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
        HOST_CMAKE_TOOLCHAIN_FILE = None
        CMAKE_TOOLCHAIN_FILE = None
        CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
        CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
        HOST_CMAKE_GENERATOR = None
        CMAKE_GENERATOR = None
        CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
        CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
        HOST_CMAKE_PREFIX_PATH = None
        CMAKE_PREFIX_PATH = None
        CMAKE_x86_64-unknown-linux-gnu = None
        CMAKE_x86_64_unknown_linux_gnu = None
        HOST_CMAKE = None
        CMAKE = None
        running: "cmake" "/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum" "-DCMAKE_CXX_STANDARD=17" "-DALUMINUM_ENABLE_NCCL=YES" "-DCUB_INCLUDE_PATH=/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/cub-1.8.0" "-DNCCL_LIBRARY=/home/ubuntu/.local/share/bagua/nccl/lib/libnccl.so" "-DNCCL_INCLUDE_PATH=/home/ubuntu/.local/share/bagua/nccl/include" "-DBUILD_SHARED_LIBS=off" "-DCMAKE_INSTALL_PREFIX=/tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/../../../bagua_core/.data" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/usr/bin/cc" "-DCMAKE_CXX_FLAGS= -std=c++17 -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/usr/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"
        -- The CXX compiler identification is GNU 9.3.0
        -- Detecting CXX compiler ABI info
        -- Detecting CXX compiler ABI info - done
        -- Check for working CXX compiler: /usr/bin/c++ - skipped
        -- Detecting CXX compile features
        -- Detecting CXX compile features - done
        -- NCCL support requested but no GPU runtime enabled. Assuming CUDA support.
        -- Performing Test CXX_COMPILER_HAS_FALIGNED_NEW
        -- Performing Test CXX_COMPILER_HAS_FALIGNED_NEW - Success
        -- Performing Test CXX_COMPILER_HAS_G3
        -- Performing Test CXX_COMPILER_HAS_G3 - Success
        -- Performing Test CXX_COMPILER_HAS_OG
        -- Performing Test CXX_COMPILER_HAS_OG - Success
        -- Found MPI_CXX: /opt/amazon/openmpi/lib/libmpi.so (found suitable version "3.1", minimum required is "3.0")
        -- Found MPI: TRUE (found suitable version "3.1", minimum required is "3.0") found components: CXX
        -- Looking for C++ include pthread.h
        -- Looking for C++ include pthread.h - found
        -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
        -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
        -- Check if compiler accepts -pthread
        -- Check if compiler accepts -pthread - yes
        -- Found Threads: TRUE
        -- Found HWLOC: /usr/lib/x86_64-linux-gnu/libhwloc.so
        -- Found CUDA: /usr/local/cuda (found suitable version "11.3", minimum required is "9.0")
        -- The CUDA compiler identification is NVIDIA 11.3.109
        -- Detecting CUDA compiler ABI info
        -- Detecting CUDA compiler ABI info - done
        -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
        -- Detecting CUDA compile features
        -- Detecting CUDA compile features - done
        -- Found NCCL: /home/ubuntu/.local/share/bagua/nccl/lib/libnccl.so (found suitable version "2.9.9", minimum required is "2.7.0")
        -- Found CUB: /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/cub-1.8.0
        -- Configuring done
        -- Generating done
        -- Build files have been written to: /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/bagua_core/.data/build
        running: "cmake" "--build" "." "--target" "install" "--config" "Release" "--parallel" "4"
        [  7%] Building CXX object src/CMakeFiles/Al.dir/Al.cpp.o
        [ 15%] Building CXX object src/CMakeFiles/Al.dir/mempool.cpp.o
        [ 23%] Building CXX object src/CMakeFiles/Al.dir/mpi_impl.cpp.o
        [ 30%] Building CXX object src/CMakeFiles/Al.dir/profiling.cpp.o
        [ 38%] Building CXX object src/CMakeFiles/Al.dir/progress.cpp.o

        --- stderr
        CMake Warning (dev) in src/CMakeLists.txt:
          Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
          empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
          for policy details.  Use the cmake_policy command to set the policy and
          suppress this warning.

          CUDA_ARCHITECTURES is empty for target "Al".
        This warning is for project developers.  Use -Wno-dev to suppress it.

        CMake Warning:
          Manually-specified variables were not used by the project:

            CMAKE_ASM_COMPILER
            CMAKE_ASM_FLAGS

        make: warning: -j4 forced in submake: resetting jobserver mode.
        In file included from /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/Al.hpp:1221,
                         from /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/src/mpi_impl.cpp:28:
        /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/aluminum/nccl_impl.hpp: In function ‘ncclRedOp_t Al::internal::nccl::ReductionOperator2ncclRedOp(Al::ReductionOperator)’:
        /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/aluminum/nccl_impl.hpp:143:12: error: ‘ncclAvg’ was not declared in this scope; did you mean ‘nccl’?
          143 |     return ncclAvg;
              |            ^~~~~~~
              |            nccl
        In file included from /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/Al.hpp:1221,
                         from /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/src/Al.cpp:35:
        /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/aluminum/nccl_impl.hpp: In function ‘ncclRedOp_t Al::internal::nccl::ReductionOperator2ncclRedOp(Al::ReductionOperator)’:
        /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/aluminum/nccl_impl.hpp:143:12: error: ‘ncclAvg’ was not declared in this scope; did you mean ‘nccl’?
          143 |     return ncclAvg;
              |            ^~~~~~~
              |            nccl
        make[2]: *** [src/CMakeFiles/Al.dir/build.make:104: src/CMakeFiles/Al.dir/mpi_impl.cpp.o] Error 1
        make[2]: *** Waiting for unfinished jobs....
        make[2]: *** [src/CMakeFiles/Al.dir/build.make:76: src/CMakeFiles/Al.dir/Al.cpp.o] Error 1
        In file included from /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/Al.hpp:1221,
                         from /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/src/progress.cpp:31:
        /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/aluminum/nccl_impl.hpp: In function ‘ncclRedOp_t Al::internal::nccl::ReductionOperator2ncclRedOp(Al::ReductionOperator)’:
        /tmp/pip-install-z14bsv9q/bagua_b1ea10d6927a48eab006199d1dfaa765/rust/bagua-core/bagua-core-internal/third_party/Aluminum/include/aluminum/nccl_impl.hpp:143:12: error: ‘ncclAvg’ was not declared in this scope; did you mean ‘nccl’?
          143 |     return ncclAvg;
              |            ^~~~~~~
              |            nccl
        make[2]: *** [src/CMakeFiles/Al.dir/build.make:132: src/CMakeFiles/Al.dir/progress.cpp.o] Error 1
        make[1]: *** [CMakeFiles/Makefile2:958: src/CMakeFiles/Al.dir/all] Error 2
        make: *** [Makefile:146: all] Error 2
        thread 'main' panicked at '
        command did not execute successfully, got: exit status: 2

        build script failed, must exit now', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/cmake-0.1.48/src/lib.rs:975:5
        note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
      error: cargo failed with code: 101

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bagua
Failed to build bagua
ERROR: Could not build wheels for bagua, which is required to install pyproject.toml-based projects

Additional context

The provided Bagua AMI is outdated, therefore I'm not using it.

mmathys commented 2 years ago

I got it to work – had to remove the bagua>=0.6 dependency.