apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Infinity loop with cmake from 1.8.0 release and v1.8.x branch #19991

Open lgg opened 3 years ago

lgg commented 3 years ago

Description

I tried to build mxnet from source from tag 1.8.0 and from branch v1.8.x for cuda 11 support.

My steps to reproduce:

cmake loop output

cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- CMAKE_CROSSCOMPILING FALSE
-- CMAKE_HOST_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_NAME Linux
-- CMake version '3.16.3' using generator 'Unix Makefiles'
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /usr/local/cuda-11.2/bin/nvcc
CMake Warning at CMakeLists.txt:103 (message):
  CMAKE_CUDA_COMPILER guessed: /usr/local/cuda/bin/nvcc

-- The CUDA compiler identification is NVIDIA 11.2.142
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Performing Test SUPPORT_CXX11
-- Performing Test SUPPORT_CXX11 - Success
-- Performing Test SUPPORT_CXX0X
-- Performing Test SUPPORT_CXX0X - Success
-- CMAKE_BUILD_TYPE is unset, defaulting to Release
-- Intel MKL-DNN compat: set DNNL_BUILD_EXAMPLES to MKLDNN_BUILD_EXAMPLES with value `OFF`
-- Intel MKL-DNN compat: set DNNL_BUILD_TESTS to MKLDNN_BUILD_TESTS with value `OFF`
-- Intel MKL-DNN compat: set DNNL_ENABLE_JIT_PROFILING to MKLDNN_ENABLE_JIT_PROFILING with value `OFF`
-- Intel MKL-DNN compat: set DNNL_LIBRARY_TYPE to MKLDNN_LIBRARY_TYPE with value `STATIC`
-- Intel MKL-DNN compat: set DNNL_ARCH_OPT_FLAGS to MKLDNN_ARCH_OPT_FLAGS with value ``
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- Primitive cache is enabled
-- Using intgemm
-- Compiling with OpenMP
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE   
-- Found OpenBLAS libraries: /usr/lib/x86_64-linux-gnu/libopenblas.so
-- Found OpenBLAS include: /usr/include/x86_64-linux-gnu
-- Found OpenCV: /usr (found version "4.2.0") found components: core highgui imgproc imgcodecs 
-- OpenCV 4.2.0 found (/usr/lib/x86_64-linux-gnu/cmake/opencv4)
--  OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
USE_LAPACK is ON
CMake Warning at 3rdparty/googletest/googletest/CMakeLists.txt:47 (project):
  VERSION keyword not followed by a value or was followed by a value that
  expanded to nothing.

-- Found PythonInterp: /usr/bin/python (found version "2.7.18") 
-- Found GTest: gtest  
-- Could NOT find CUDNN (missing: CUDNN_LIBRARY CUDNN_INCLUDE) 
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Looking for fopen64
-- Looking for fopen64 - not found
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for nanosleep
-- Looking for nanosleep - found
-- Looking for backtrace
-- Looking for backtrace - found
-- backtrace facility detected in default set of libraries
-- Found Backtrace: /usr/include  
-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- /home/user/mxnet1.8/mxnet1.8/3rdparty/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h
-- Performing Test SUPPORT_MSSE2
-- Performing Test SUPPORT_MSSE2 - Success
-- Autodetected CUDA architecture(s):  6.1 6.1
-- CUDA: Using the following NVCC architecture flags -gencode;arch=compute_61,code=sm_61
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.2.142") 
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIRS NCCL_LIBRARIES) 
CMake Warning at CMakeLists.txt:645 (message):
  Could not find NCCL libraries

-- Performing Test SUPPORT_MSSE3
-- Performing Test SUPPORT_MSSE3 - Success
-- Determining F16C support
-- Performing Test COMPILER_SUPPORT_MF16C
-- Performing Test COMPILER_SUPPORT_MF16C - Success
-- Using 64-bit integer for tensor size
-- CUDA: Adding NVCC options: --fatbin-options --compress-all
-- Configuring done
You have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_CUDA_COMPILER= /usr/local/cuda-11.2/bin/nvcc
CMAKE_CUDA_COMPILER= /usr/local/cuda-11.2/bin/nvcc

-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- CMAKE_CROSSCOMPILING FALSE
-- CMAKE_HOST_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_NAME Linux
-- CMake version '3.16.3' using generator 'Unix Makefiles'
CMake Warning at CMakeLists.txt:103 (message):
  CMAKE_CUDA_COMPILER guessed: /usr/local/cuda/bin/nvcc

-- The CUDA compiler identification is NVIDIA 11.2.142
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Performing Test SUPPORT_CXX11
-- Performing Test SUPPORT_CXX11 - Success
-- Performing Test SUPPORT_CXX0X
-- Performing Test SUPPORT_CXX0X - Success
-- CMAKE_BUILD_TYPE is unset, defaulting to Release
-- Intel MKL-DNN compat: set DNNL_BUILD_EXAMPLES to MKLDNN_BUILD_EXAMPLES with value `OFF`
-- Intel MKL-DNN compat: set DNNL_BUILD_TESTS to MKLDNN_BUILD_TESTS with value `OFF`
-- Intel MKL-DNN compat: set DNNL_ENABLE_JIT_PROFILING to MKLDNN_ENABLE_JIT_PROFILING with value `OFF`
-- Intel MKL-DNN compat: set DNNL_LIBRARY_TYPE to MKLDNN_LIBRARY_TYPE with value `STATIC`
-- Intel MKL-DNN compat: set DNNL_ARCH_OPT_FLAGS to MKLDNN_ARCH_OPT_FLAGS with value ``
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- Primitive cache is enabled
-- Using intgemm
-- Compiling with OpenMP
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE   
-- Found OpenBLAS libraries: /usr/lib/x86_64-linux-gnu/libopenblas.so
-- Found OpenBLAS include: /usr/include/x86_64-linux-gnu
-- Found OpenCV: /usr (found version "4.2.0") found components: core highgui imgproc imgcodecs 
-- OpenCV 4.2.0 found (/usr/lib/x86_64-linux-gnu/cmake/opencv4)
--  OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
USE_LAPACK is ON
CMake Warning at 3rdparty/googletest/googletest/CMakeLists.txt:47 (project):
  VERSION keyword not followed by a value or was followed by a value that
  expanded to nothing.

-- Found PythonInterp: /usr/bin/python (found version "2.7.18") 
-- Found GTest: gtest  
-- Could NOT find CUDNN (missing: CUDNN_LIBRARY CUDNN_INCLUDE) 
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Looking for fopen64
-- Looking for fopen64 - not found
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for nanosleep
-- Looking for nanosleep - found
-- Looking for backtrace
-- Looking for backtrace - found
-- backtrace facility detected in default set of libraries
-- Found Backtrace: /usr/include  
-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
^C

nvcc paths:

In output above all paths for nvcc are valid

...
CMake Warning at CMakeLists.txt:103 (message):
  CMAKE_CUDA_COMPILER guessed: /usr/local/cuda/bin/nvcc
...
-- The CUDA compiler identification is NVIDIA 11.2.142
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
...
You have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_CUDA_COMPILER= /usr/local/cuda-11.2/bin/nvcc
CMAKE_CUDA_COMPILER= /usr/local/cuda-11.2/bin/nvcc

I checked this paths:

user@ml-dev:~/mxnet1.8/mxnet1.8/build$ ll  /usr/local/cuda-11.2/bin/nvcc
-rwxr-xr-x 1 root root 5473704 фев 28 17:25 /usr/local/cuda-11.2/bin/nvcc*
user@ml-dev:~/mxnet1.8/mxnet1.8/build$ ll /usr/local/cuda/bin/nvcc
-rwxr-xr-x 1 root root 5473704 фев 28 17:25 /usr/local/cuda/bin/nvcc*

What have you tried to solve it?

  1. Checked path for nvcc as said in this issue: https://github.com/apache/incubator-mxnet/issues/17761

Environment

We recommend using our script for collecting the diagnostic information with the following command curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information

----------Python Info----------
Version      : 3.8.5
Compiler     : GCC 9.3.0
Build        : ('default', 'Jan 27 2021 15:41:15')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /usr/lib/python3/dist-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /home/user/.local/lib/python3.8/site-packages/mxnet-2.0.0-py3.8.egg/mxnet
Commit hash file "/home/user/.local/lib/python3.8/site-packages/mxnet-2.0.0-py3.8.egg/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/user/.local/lib/python3.8/site-packages/mxnet-2.0.0-py3.8.egg/mxnet/libmxnet.so']
Build features:
✔ CUDA
✖ CUDNN
✖ NCCL
✖ TENSORRT
✖ CUTENSOR
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ DIST_KVSTORE
✔ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-5.8.0-44-generic-x86_64-with-glibc2.29
system       : Linux
node         : ml-dev
release      : 5.8.0-44-generic
version      : #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Stepping:                        10
CPU MHz:                         4327.154
CPU max MHz:                     4700,0000
CPU min MHz:                     800,0000
BogoMIPS:                        7399.70
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1,5 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Mitigation; Microcode
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 
                                 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no
                                 nstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdc
                                 m pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpref
                                 etch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb
                                 ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xs
                                 avec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0024 sec, LOAD: 0.4952 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.2280 sec, LOAD: 0.5484 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)>, DNS finished in 1.8177919387817383 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0738 sec, LOAD: 0.8640 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0356 sec, LOAD: 0.7704 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.11126899719238281 sec.
----------Environment----------

More env info from me:

More Environment Information

user@ml-dev:~/mxnet1.8/mxnet1.8/build$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0
user@ml-dev:~/mxnet1.8/mxnet1.8/build$ uname -a
Linux ml-dev 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
user@ml-dev:~/mxnet1.8/mxnet1.8/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:    20.04
Codename:   focal

Config.cmake

My config.cmake

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

#-------------------------------------------------------------------------------
#  Template configuration for compiling MXNet
#
#  If you want to change the configuration, please use the following steps.
#  Assume you are on the root directory of mxnet. First copy this file so that
#  any local changes will be ignored by git
#
#  $ cp config/linux_gpu.cmake config.cmake
#
#  Next modify the entries in the config.cmake like MXNET_CUDA_ARCH to set the specific
#  GPU architecture, and then compile by
#
#  $ mkdir build; cd build
#  $ cmake ..
#  $ cmake --build .
#
# Specify `cmake --build . --parallel N` to set the number of parallel compilation jobs.
# Default is derived from CPUs available.
#
#-------------------------------------------------------------------------------

#---------------------------------------------
# GPU support
#---------------------------------------------
set(USE_CUDA ON CACHE BOOL "Build with CUDA support")
set(USE_CUDNN ON CACHE BOOL "Build with cudnn support, if found")
set(USE_CUTENSOR ON CACHE BOOL "Build with cutensor support, if found")

# Target NVIDIA GPU achitecture.
# Valid options are:
#   - "Auto" for autodetection, will try and discover which GPU architecture to use by
#            looking at the available GPUs on the machine that you're building on
#   - "All" for all available GPU architectures supported by the version of CUDA installed
#   - "specific GPU architectures" by giving the compute capability number such as
#            "7.0" or "7.0;7.5" (ie. sm_70 or sm_75) or you can specify the name like:
#            "Volta" or "Volta;Turing", be sure not to use quotes (ie. just set to 7.0)
# The value specified here is passed to cmake's CUDA_SELECT_NVCC_ARCH_FLAGS to
# obtain the compilation flags for nvcc.
#
# When compiling on a machine without GPU, autodetection will fail and you
# should instead specify the target architecture manually.
set(MXNET_CUDA_ARCH "Auto" CACHE STRING "Target NVIDIA GPU achitecture")

#---------------------------------------------
# Common libraries
#---------------------------------------------
set(USE_BLAS "open" CACHE STRING "BLAS Vendor")

set(USE_OPENCV ON CACHE BOOL "Build with OpenCV support")
set(OPENCV_ROOT "" CACHE BOOL "OpenCV install path. Supports autodetection.")

set(USE_OPENMP ON CACHE BOOL "Build with Openmp support")

set(USE_MKL_IF_AVAILABLE ON CACHE BOOL "Use Intel MKL if found")
set(USE_MKLDNN ON CACHE BOOL "Build with MKL-DNN support")

set(USE_LAPACK ON CACHE BOOL "Build with lapack support")

set(USE_TVM_OP OFF CACHE BOOL "Enable use of TVM operator build system.")

#---------------------
# Compilers
#--------------------
# Compilers are usually autodetected. Uncomment and modify the next 3 lines to
# choose manually:

# set(CMAKE_C_COMPILER "" CACHE BOOL "C compiler")
# set(CMAKE_CXX_COMPILER "" CACHE BOOL "C++ compiler")
# set(CMAKE_CUDA_COMPILER "" CACHE BOOL "Cuda compiler (nvcc)")

#---------------------------------------------
# CPU instruction sets: The support is autodetected if turned ON
#---------------------------------------------
set(USE_SSE ON CACHE BOOL "Build with x86 SSE instruction support")
set(USE_F16C ON CACHE BOOL "Build with x86 F16C instruction support")

#----------------------------
# distributed computing
#----------------------------
set(USE_DIST_KVSTORE OFF CACHE BOOL "Build with DIST_KVSTORE support")

#----------------------------
# performance settings
#----------------------------
set(USE_OPERATOR_TUNING ON CACHE BOOL  "Enable auto-tuning of operators")
set(USE_GPERFTOOLS OFF CACHE BOOL "Build with GPerfTools support")
set(USE_JEMALLOC OFF CACHE BOOL "Build with Jemalloc support")

#----------------------------
# additional operators
#----------------------------
# path to folders containing projects specific operators that you don't want to
# put in src/operators
SET(EXTRA_OPERATORS "" CACHE PATH "EXTRA OPERATORS PATH")

#----------------------------
# other features
#----------------------------
# Create C++ interface package
set(USE_CPP_PACKAGE OFF CACHE BOOL "Build C++ Package")

# Use int64_t type to represent the total number of elements in a tensor
# This will cause performance degradation reported in issue #14496
# Set to 1 for large tensor with tensor size greater than INT32_MAX i.e. 2147483647
# Note: the size of each dimension is still bounded by INT32_MAX
set(USE_INT64_TENSOR_SIZE ON CACHE BOOL "Use int64_t to represent the total number of elements in a tensor")

# Other GPU features
set(USE_NCCL "Use NVidia NCCL with CUDA" OFF)
set(NCCL_ROOT "" CACHE BOOL "NCCL install path. Supports autodetection.")
set(USE_NVML OFF CACHE BOOL "Build with NVML support")
set(USE_NVTX ON CACHE BOOL "Build with NVTX support")

##################################### CUDA 11

set(CMAKE_BUILD_TYPE "Distribution" CACHE STRING "Build type")
set(CFLAGS "-mno-avx" CACHE STRING "CFLAGS")
set(CXXFLAGS "-mno-avx" CACHE STRING "CXXFLAGS")

set(USE_CUDA ON CACHE BOOL "Build with CUDA support")
set(USE_CUDNN ON CACHE BOOL "Build with CUDA support")
set(USE_OPENCV ON CACHE BOOL "Build with OpenCV support")
set(USE_OPENMP ON CACHE BOOL "Build with Openmp support")
set(USE_MKL_IF_AVAILABLE OFF CACHE BOOL "Use Intel MKL if found")
set(USE_MKLDNN ON CACHE BOOL "Build with MKL-DNN support")
set(USE_LAPACK ON CACHE BOOL "Build with lapack support")
set(USE_TVM_OP OFF CACHE BOOL "Enable use of TVM operator build system.")
set(USE_SSE ON CACHE BOOL "Build with x86 SSE instruction support")
set(USE_F16C OFF CACHE BOOL "Build with x86 F16C instruction support")

set(CUDACXX "/usr/local/cuda-11.2/bin/nvcc" CACHE STRING "Cuda compiler")
set(MXNET_CUDA_ARCH "5.0;6.0;7.0;8.0;8.6" CACHE STRING "Cuda architectures")
github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

leezu commented 3 years ago

Can you please try if the behavior is reproducible on the latest v0.x and master branches?

lgg commented 3 years ago

@leezu i successfully compiled from master branch, but it have mxnet==2.0.0

I will try to run it with v0.x branch

lgg commented 3 years ago

@leezu hmm, i can't found v0.x ;(

user@ml-dev:~/mxnet1.8/mxnet1.8$ git checkout v0.x
error: pathspec 'v0.x' did not match any file(s) known to git
user@ml-dev:~/mxnet1.8/mxnet1.8$ git checkout 0.x
error: pathspec '0.x' did not match any file(s) known to git
user@ml-dev:~/mxnet1.8/mxnet1.8$ git checkout 0.x
error: pathspec '0.x' did not match any file(s) known to git
user@ml-dev:~/mxnet1.8/mxnet1.8$ git checkout v0.x
error: pathspec 'v0.x' did not match any file(s) known to git
lgg commented 3 years ago

I also tried: v1.x branch - the same issue with infinity loop.

From master branch it goes fine:

$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- CMAKE_CROSSCOMPILING FALSE
-- CMAKE_HOST_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_NAME Linux
-- CMake version '3.16.3' using generator 'Unix Makefiles'
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /usr/local/cuda-11.2/bin/nvcc
-- The CUDA compiler identification is NVIDIA 11.2.142
-- Check for working CUDA compiler: /usr/local/cuda-11.2/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-11.2/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- CMAKE_BUILD_TYPE is unset, defaulting to Release
-- Intel MKL-DNN compat: set DNNL_BUILD_EXAMPLES to MKLDNN_BUILD_EXAMPLES with value `OFF`
-- Intel MKL-DNN compat: set DNNL_BUILD_TESTS to MKLDNN_BUILD_TESTS with value `OFF`
-- Intel MKL-DNN compat: set DNNL_ENABLE_JIT_PROFILING to MKLDNN_ENABLE_JIT_PROFILING with value `OFF`
-- Intel MKL-DNN compat: set DNNL_LIBRARY_TYPE to MKLDNN_LIBRARY_TYPE with value `STATIC`
-- Intel MKL-DNN compat: set DNNL_ARCH_OPT_FLAGS to MKLDNN_ARCH_OPT_FLAGS with value ``
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- Primitive cache is enabled
-- Using intgemm
-- Compiling with OpenMP
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE   
-- Found OpenBLAS libraries: /usr/lib/x86_64-linux-gnu/libopenblas.a
-- Found OpenBLAS include: /usr/include/x86_64-linux-gnu
Openblas uses GFortran, automatically linking to it
-- The Fortran compiler identification is GNU 9.3.0
-- Check for working Fortran compiler: /usr/bin/f95
-- Check for working Fortran compiler: /usr/bin/f95  -- works
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Checking whether /usr/bin/f95 supports Fortran 90
-- Checking whether /usr/bin/f95 supports Fortran 90 -- yes
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/mxnet1.8/mxnet1.8/build/temp
FORTRAN_DIR is /usr/lib/gcc/x86_64-linux-gnu/9;/usr/lib/x86_64-linux-gnu;/usr/lib;/lib/x86_64-linux-gnu;/lib
FORTRAN_LIB is /usr/lib/gcc/x86_64-linux-gnu/9/libgfortran.so
-- Looking for OPENBLAS_USE64BITINT
-- Looking for OPENBLAS_USE64BITINT - not found
Using LP64 OpenBLAS
After choosing blas, linking to dnnl;/usr/lib/x86_64-linux-gnu/libopenblas.a;/usr/lib/gcc/x86_64-linux-gnu/9/libgfortran.so
-- Found OpenCV: /usr (found version "4.2.0") found components: core highgui imgproc imgcodecs 
-- OpenCV 4.2.0 found (/usr/lib/x86_64-linux-gnu/cmake/opencv4)
--  OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
USE_LAPACK is ON
CMake Warning at 3rdparty/googletest/googletest/CMakeLists.txt:47 (project):
  VERSION keyword not followed by a value or was followed by a value that
  expanded to nothing.

-- Found PythonInterp: /usr/bin/python (found version "2.7.18") 
-- Found GTest: gtest  
-- Could NOT find CUDNN (missing: CUDNN_LIBRARY CUDNN_INCLUDE) 
-- Could NOT find CUTENSOR (missing: CUTENSOR_LIBRARY CUTENSOR_INCLUDE) 
CMake Warning (dev) at 3rdparty/dmlc-core/cmake/Utils.cmake:196 (option):
  Policy CMP0077 is not set: option() honors normal variables.  Run "cmake
  --help-policy CMP0077" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

  For compatibility with older versions of CMake, option is clearing the
  normal variable 'DMLC_FORCE_SHARED_CRT'.
Call Stack (most recent call first):
  3rdparty/dmlc-core/CMakeLists.txt:23 (dmlccore_option)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Looking for fopen64
-- Looking for fopen64 - not found
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for nanosleep
-- Looking for nanosleep - found
-- Looking for backtrace
-- Looking for backtrace - found
-- backtrace facility detected in default set of libraries
-- Found Backtrace: /usr/include  
-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- /home/user/mxnet1.8/mxnet1.8/3rdparty/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h
-- Performing Test SUPPORT_MSSE2
-- Performing Test SUPPORT_MSSE2 - Success
-- Autodetected CUDA architecture(s):  6.1 6.1
-- CUDA: Using the following NVCC architecture flags -gencode;arch=compute_61,code=sm_61
-- Found CUDAToolkit: /usr/local/cuda-11.2/include (found version "11.2.142") 
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIRS NCCL_LIBRARIES) 
CMake Warning at CMakeLists.txt:614 (message):
  Could not find NCCL libraries

-- Performing Test SUPPORT_MSSE3
-- Performing Test SUPPORT_MSSE3 - Success
-- Determining F16C support
-- Performing Test COMPILER_SUPPORT_MF16C
-- Performing Test COMPILER_SUPPORT_MF16C - Success
-- Using 64-bit integer for tensor size
-- Found Python3: /usr/bin/python3.8 (found version "3.8.5") found components: Interpreter 
-- CUDA: Adding NVCC options: --fatbin-options --compress-all
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/mxnet1.8/mxnet1.8/build
lgg commented 3 years ago

Note: on every try with building from different git branches/tags - i cleared build folder with rm -rf build

lgg commented 3 years ago

I guess I found solution:

Manually set this in config.cmake

set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc" CACHE BOOL "Cuda compiler (nvcc)")

helped and provide this output:

$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- CMAKE_CROSSCOMPILING FALSE
-- CMAKE_HOST_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_PROCESSOR x86_64
-- CMAKE_SYSTEM_NAME Linux
-- CMake version '3.16.3' using generator 'Unix Makefiles'
CMake Warning at CMakeLists.txt:109 (message):
  CMAKE_CUDA_COMPILER guessed: /usr/local/cuda/bin/nvcc

-- The CUDA compiler identification is NVIDIA 11.2.142
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Performing Test SUPPORT_CXX11
-- Performing Test SUPPORT_CXX11 - Success
-- Performing Test SUPPORT_CXX0X
-- Performing Test SUPPORT_CXX0X - Success
-- CMAKE_BUILD_TYPE is unset, defaulting to Release
-- Intel MKL-DNN compat: set DNNL_BUILD_EXAMPLES to MKLDNN_BUILD_EXAMPLES with value `OFF`
-- Intel MKL-DNN compat: set DNNL_BUILD_TESTS to MKLDNN_BUILD_TESTS with value `OFF`
-- Intel MKL-DNN compat: set DNNL_ENABLE_JIT_PROFILING to MKLDNN_ENABLE_JIT_PROFILING with value `OFF`
-- Intel MKL-DNN compat: set DNNL_LIBRARY_TYPE to MKLDNN_LIBRARY_TYPE with value `STATIC`
-- Intel MKL-DNN compat: set DNNL_ARCH_OPT_FLAGS to MKLDNN_ARCH_OPT_FLAGS with value ``
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- Primitive cache is enabled
-- Using intgemm
-- Compiling with OpenMP
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE   
-- Found OpenBLAS libraries: /usr/lib/x86_64-linux-gnu/libopenblas.so
-- Found OpenBLAS include: /usr/include/x86_64-linux-gnu
-- Found OpenCV: /usr (found version "4.2.0") found components: core highgui imgproc imgcodecs 
-- OpenCV 4.2.0 found (/usr/lib/x86_64-linux-gnu/cmake/opencv4)
--  OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
USE_LAPACK is ON
CMake Warning at 3rdparty/googletest/googletest/CMakeLists.txt:47 (project):
  VERSION keyword not followed by a value or was followed by a value that
  expanded to nothing.

-- Found PythonInterp: /usr/bin/python (found version "2.7.18") 
-- Found GTest: gtest  
-- Could NOT find CUDNN (missing: CUDNN_LIBRARY CUDNN_INCLUDE) 
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Looking for fopen64
-- Looking for fopen64 - not found
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for nanosleep
-- Looking for nanosleep - found
-- Looking for backtrace
-- Looking for backtrace - found
-- backtrace facility detected in default set of libraries
-- Found Backtrace: /usr/include  
-- Check if the system is big endian
-- Searching 16 bit integer
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- /home/user/mxnet1.8/mxnet1.8/3rdparty/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h
-- Performing Test SUPPORT_MSSE2
-- Performing Test SUPPORT_MSSE2 - Success
-- Autodetected CUDA architecture(s):  6.1 6.1
-- CUDA: Using the following NVCC architecture flags -gencode;arch=compute_61,code=sm_61
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.2.142") 
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIRS NCCL_LIBRARIES) 
CMake Warning at CMakeLists.txt:668 (message):
  Could not find NCCL libraries

-- Performing Test SUPPORT_MSSE3
-- Performing Test SUPPORT_MSSE3 - Success
-- Determining F16C support
-- Performing Test COMPILER_SUPPORT_MF16C
-- Performing Test COMPILER_SUPPORT_MF16C - Success
-- CUDA: Adding NVCC options: --fatbin-options --compress-all
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/mxnet1.8/mxnet1.8/build

but if i change this path to:

set(CMAKE_CUDA_COMPILER "/usr/local/cuda-11.2/bin/nvcc" CACHE BOOL "Cuda compiler (nvcc)")

it still goes to infinity loop

cloudlakecho commented 1 year ago

@lgg Thanks for sharing tip. You solution looks also working with CUDA 11.4 in Arm processor.

I am just curious how you find the solution from the symptom?