NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
948 stars 200 forks source link

[BUG] Installing of sparse_operation_kit from pip failed #346

Closed silpara closed 2 years ago

silpara commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Use AMI https://aws.amazon.com/releasenotes/deep-learning-ami-gpu-tensorflow-2-9-ubuntu-20-04/ to spin a ec2 cluster instance type g4dn.xxlarge
  2. pip3.9 install sparse_operation_kit

Logs

Defaulting to user installation because normal site-packages is not writeable
Collecting sparse_operation_kit
  Using cached sparse_operation_kit-1.1.2-py3-none-any.whl
Collecting merlin-sok
  Using cached merlin-sok-1.1.3.tar.gz (152 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: merlin-sok
  Building wheel for merlin-sok (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for merlin-sok (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [130 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.9
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit
      copying ./sparse_operation_kit/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
      copying ./sparse_operation_kit/kit_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/_version.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/embedding_layer_handle.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/context_scope.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/embedding_variable_v2.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/inplace_initializer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/embedding_variable_v1.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/initialize.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/graph_keys.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
      copying ./sparse_operation_kit/operations/compat_ops_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
      copying ./sparse_operation_kit/operations/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
      copying ./sparse_operation_kit/saver/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
      copying ./sparse_operation_kit/saver/Saver.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/all2all_dense_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/embedding_ops.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/tf_distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/get_embedding_op.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/utils.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/base_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
      copying ./sparse_operation_kit/tf/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
      copying ./sparse_operation_kit/tf/keras/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
      copying ./sparse_operation_kit/tf/keras/mixed_precision/loss_scale_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
      copying ./sparse_operation_kit/tf/keras/mixed_precision/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/common.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/lazy_adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      running build_ext
      -- The CXX compiler identification is GNU 9.4.0
      -- The CUDA compiler identification is NVIDIA 11.2.152
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Detecting CUDA compiler ABI info
      -- Detecting CUDA compiler ABI info - done
      -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
      -- Detecting CUDA compile features
      -- Detecting CUDA compile features - done
      -- Building Sparse Operation Kit from source.
      -- Looking for C++ include pthread.h
      -- Looking for C++ include pthread.h - found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
      -- Looking for pthread_create in pthreads
      -- Looking for pthread_create in pthreads - not found
      -- Looking for pthread_create in pthread
      -- Looking for pthread_create in pthread - found
      -- Found Threads: TRUE
      -- Found CUDA: /usr/local/cuda (found version "11.2")
      CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARIES)
      Call Stack (most recent call first):
        /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
        cmakes/FindNCCL.cmake:36 (find_package_handle_standard_args)
        CMakeLists.txt:25 (find_package)

      -- Configuring incomplete, errors occurred!
      See also "/tmp/pip-install-w5u7if11/merlin-sok_38d3b0c9055e49fba249838a52d59899/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
      See also "/tmp/pip-install-w5u7if11/merlin-sok_38d3b0c9055e49fba249838a52d59899/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeError.log".
      Traceback (most recent call last):
        File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 261, in build_wheel
          return _build_backend().build_wheel(wheel_directory, config_settings,
        File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 230, in build_wheel
          return self._build_with_temp_dir(['bdist_wheel'], '.whl',
        File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 215, in _build_with_temp_dir
          self.run_setup()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 158, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File "setup.py", line 182, in <module>
          setup(
        File "/usr/local/lib/python3.9/dist-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 148, in setup
          return run_commands(dist)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 163, in run_commands
          dist.run_commands()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 967, in run_commands
          self.run_command(cmd)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-et8cy105/overlay/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
          self.build_extensions()
        File "setup.py", line 102, in build_extensions
          subprocess.check_call("cmake {} {} && make -j{}".format(cmake_args,
        File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command 'cmake -DSM='70;75;80' -DUSE_NVTX=OFF -DSOK_ASYNC=ON -DSOK_UNIT_TEST=OFF -DCMAKE_BUILD_TYPE=Release /tmp/pip-install-w5u7if11/merlin-sok_38d3b0c9055e49fba249838a52d59899 && make -j$(nproc)' returned non-zero exit status 1.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for merlin-sok
Failed to build merlin-sok
ERROR: Could not build wheels for merlin-sok, which is required to install pyproject.toml-based projects

Environment (please complete the following information):

Additional context Already tried adding NCCL_INCLUDE_DIR=/usr/local/cuda-11.2/include/ and NCCL_LIBRARIES=/usr/local/cuda-11.2/lib/ as environment variables but getting the same error.

kunlunl commented 2 years ago

Has the NCCL library been installed in /usr/local/cuda-11.2/... ? If yes, could you try set environment variable NCCL_DIR=/usr/local/cuda-11.2/?

silpara commented 2 years ago

There is no separate directory for NCCL at /usr/local/cuda-11.2/ but nccl and libnccl files are present in usr/local/cuda-11.2/include/ and /usr/local/cuda-11.2/lib/ respectively. Setting NCCL_DIR=/usr/local/cuda-11.2/ did not work either.

kunlunl commented 2 years ago

@silpara Could you give the specific installation location of nccl.h and libnccl.so?

For example, in the nvcr.io/nvidia/tensorflow:22.05-tf2-py3 container:

The following cmake file describes where sok will look for nccl.h and libnccl.so: https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/sparse_operation_kit/cmakes/FindNCCL.cmake

set(NCCL_INC_PATHS
    /usr/include
    /usr/local/include
    $ENV{NCCL_DIR}/include
    )

set(NCCL_LIB_PATHS
    /lib
    /lib64
    /usr/lib
    /usr/lib64
    /usr/local/lib
    /usr/local/lib64
    $ENV{NCCL_DIR}/lib
    )

find_path(NCCL_INCLUDE_DIR NAMES nccl.h PATHS ${NCCL_INC_PATHS})
find_library(NCCL_LIBRARIES NAMES nccl PATHS ${NCCL_LIB_PATHS})

If your NCCL is not installed in standard path, then you can tell cmake by setting NCCL_DIR, this is why I said you can try setting NCCL_DIR=/usr/local/cuda-11.2/. But if your nccl.h and libnccl.so are stored in a nested folder under /usr/local/cuda-11.2, cmake will still can't find them.

The easiest way to solve this is to create a soft link like:

ln /usr/local/cuda11.2/.../nccl.h /usr/include/nccl.h
ln /usr/local/cuda11.2/.../libnccl.so* /usr/lib/
silpara commented 2 years ago

Your suggestion seems to work and cmake is able to locate NCCL but another error is coming now

Defaulting to user installation because normal site-packages is not writeable
Collecting sparse_operation_kit
  Using cached sparse_operation_kit-1.1.2-py3-none-any.whl
Collecting merlin-sok
  Using cached merlin-sok-1.1.3.tar.gz (152 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: merlin-sok
  Building wheel for merlin-sok (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for merlin-sok (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [158 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.9
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit
      copying ./sparse_operation_kit/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
      copying ./sparse_operation_kit/kit_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/_version.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/embedding_layer_handle.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/context_scope.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/embedding_variable_v2.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/inplace_initializer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/embedding_variable_v1.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/initialize.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      copying ./sparse_operation_kit/core/graph_keys.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
      copying ./sparse_operation_kit/operations/compat_ops_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
      copying ./sparse_operation_kit/operations/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
      copying ./sparse_operation_kit/saver/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
      copying ./sparse_operation_kit/saver/Saver.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/all2all_dense_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/embedding_ops.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/tf_distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      copying ./sparse_operation_kit/embeddings/get_embedding_op.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/utils.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      copying ./sparse_operation_kit/optimizers/base_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
      copying ./sparse_operation_kit/tf/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
      copying ./sparse_operation_kit/tf/keras/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
      copying ./sparse_operation_kit/tf/keras/mixed_precision/loss_scale_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
      copying ./sparse_operation_kit/tf/keras/mixed_precision/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
      creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/common.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/lazy_adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      copying ./sparse_operation_kit/tf/keras/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
      running build_ext
      -- The CXX compiler identification is GNU 9.4.0
      -- The CUDA compiler identification is NVIDIA 11.2.152
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Detecting CUDA compiler ABI info
      -- Detecting CUDA compiler ABI info - done
      -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
      -- Detecting CUDA compile features
      -- Detecting CUDA compile features - done
      -- Building Sparse Operation Kit from source.
      -- Looking for C++ include pthread.h
      -- Looking for C++ include pthread.h - found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
      -- Looking for pthread_create in pthreads
      -- Looking for pthread_create in pthreads - not found
      -- Looking for pthread_create in pthread
      -- Looking for pthread_create in pthread - found
      -- Found Threads: TRUE
      -- Found CUDA: /usr/local/cuda (found version "11.2")
      -- Found NCCL: /usr/include
      -- Found NCCL    (include: /usr/include, library: /usr/lib/libnccl.so)
      CMake Error at cmakes/FindTensorFlow.cmake:23 (string):
        string sub-command REPLACE requires at least four arguments.
      Call Stack (most recent call first):
        CMakeLists.txt:26 (find_package)

      -- TensorFlow version =
      CMake Error at cmakes/FindTensorFlow.cmake:30 (string):
        string sub-command REGEX, mode MATCH needs at least 5 arguments total to
        command.
      Call Stack (most recent call first):
        CMakeLists.txt:26 (find_package)

      CMake Error at cmakes/FindTensorFlow.cmake:31 (string):
        string sub-command REPLACE requires at least four arguments.
      Call Stack (most recent call first):
        CMakeLists.txt:26 (find_package)

      CMake Error at cmakes/FindTensorFlow.cmake:32 (string):
        string sub-command REPLACE requires at least four arguments.
      Call Stack (most recent call first):
        CMakeLists.txt:26 (find_package)

      CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find TensorFlow (missing: TF_LINK_DIR)
      Call Stack (most recent call first):
        /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
        cmakes/FindTensorFlow.cmake:35 (find_package_handle_standard_args)
        CMakeLists.txt:26 (find_package)

      -- Configuring incomplete, errors occurred!
      See also "/tmp/pip-install-t34pucrf/merlin-sok_cc05886f9a5244318d2001913f09174b/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
      See also "/tmp/pip-install-t34pucrf/merlin-sok_cc05886f9a5244318d2001913f09174b/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeError.log".
      Traceback (most recent call last):
        File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 261, in build_wheel
          return _build_backend().build_wheel(wheel_directory, config_settings,
        File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 230, in build_wheel
          return self._build_with_temp_dir(['bdist_wheel'], '.whl',
        File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 215, in _build_with_temp_dir
          self.run_setup()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 158, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File "setup.py", line 182, in <module>
          setup(
        File "/usr/local/lib/python3.9/dist-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 148, in setup
          return run_commands(dist)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 163, in run_commands
          dist.run_commands()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 967, in run_commands
          self.run_command(cmd)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-8b6qrdse/overlay/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.9/dist-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
          self.build_extensions()
        File "setup.py", line 102, in build_extensions
          subprocess.check_call("cmake {} {} && make -j{}".format(cmake_args,
        File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command 'cmake -DSM='70;75;80' -DUSE_NVTX=OFF -DSOK_ASYNC=ON -DSOK_UNIT_TEST=OFF -DCMAKE_BUILD_TYPE=Release /tmp/pip-install-t34pucrf/merlin-sok_cc05886f9a5244318d2001913f09174b && make -j$(nproc)' returned non-zero exit status 1.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for merlin-sok
Failed to build merlin-sok
ERROR: Could not build wheels for merlin-sok, which is required to install pyproject.toml-based projects
kunlunl commented 2 years ago

It looks like cmake can't find lib of tensorflow now.

SOK use these commands to locate the lib of tensorflow:

python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_compile_flags()))"
python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_link_flags()))"
python -c "import tensorflow as tf; print(tf.__version__)"

Could you try these commands to see what will print?

In nvcr.io/nvidia/tensorflow:22.05-tf2-py3 container, the outputs are like:

-I/usr/local/lib/python3.8/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -DEIGEN_MAX_ALIGN_BYTES=64
-L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2
2.8.0

I can't get access to a AWS instance now, but I guess the python in your container corresponds to python2 actually, and you will input python3 when you use tensorflow. This caused the above commands to fail. If my guess is correct, you can create an alias before running cmake:

alias python=python3

And you can also unalias it after the installation.

unalias python
zongshibuzai commented 2 years ago

I also encountered the same mistake above, please ask how to solve it now?

kunlunl commented 2 years ago

I also encountered the same mistake above, please ask how to solve it now?

cmake cannot find nccl or tensorflow? @zongshibuzai

zongshibuzai commented 2 years ago

At first, cmake cannot find nccl. when use ln /usr/local/cuda11.2/.../nccl.h /usr/include/nccl.h ln /usr/local/cuda11.2/.../libnccl.so* /usr/lib/ now cmake cannot find tensorflow.

kunlunl commented 2 years ago

Could you try these commands to see what will print?

python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_compile_flags()))"
python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_link_flags()))"
python -c "import tensorflow as tf; print(tf.__version__)"
zongshibuzai commented 2 years ago

the outputs are like: -I/usr/local/python3/lib/python3.7/site-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -DEIGEN_MAX_ALIGN_BYTES=64 -L/usr/local/python3/lib/python3.7/site-packages/tensorflow -l:libtensorflow_framework.so.2 2.8.0

kunlunl commented 2 years ago

Does the error logs are also like:

...
CMake Error at cmakes/FindTensorFlow.cmake:23 (string):
        string sub-command REPLACE requires at least four arguments.
...
CMake Error at cmakes/FindTensorFlow.cmake:30 (string):
        string sub-command REGEX, mode MATCH needs at least 5 arguments total to
        command.
...

And I want to double check, you got 2.8.0 with python -c "import tensorflow as tf; print(tf.__version__)", not python3 ..., right?

zongshibuzai commented 2 years ago

the error logs : ModuleNotFoundError: No module named 'tensorflow' Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'tensorflow' Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'tensorflow' CMake Error at cmakes/FindTensorFlow.cmake:23 (string): string sub-command REPLACE requires at least four arguments. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)

-- TensorFlow version = CMake Error at cmakes/FindTensorFlow.cmake:30 (string): string sub-command REGEX, mode MATCH needs at least 5 arguments total to command. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)

CMake Error at cmakes/FindTensorFlow.cmake:31 (string): string sub-command REPLACE requires at least four arguments. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)

CMake Error at cmakes/FindTensorFlow.cmake:32 (string): string sub-command REPLACE requires at least four arguments. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)

CMake Error at /usr/local/share/cmake-3.8/Modules/FindPackageHandleStandardArgs.cmake:137 (message): Could NOT find TensorFlow (missing: TF_LINK_DIR) Call Stack (most recent call first): /usr/local/share/cmake-3.8/Modules/FindPackageHandleStandardArgs.cmake:377 (_FPHSA_FAILURE_MESSAGE) cmakes/FindTensorFlow.cmake:35 (find_package_handle_standard_args) CMakeLists.txt:26 (find_package)

bash-4.2# python -c "import tensorflow as tf; print(tf.version)" 2.8.0 wo got 2.8.0

kunlunl commented 2 years ago

I can see why cmake fails, because we let cmake execute a subprocess(script in here) like python -c "import tensorflow as tf; print(tf.version) to get the location of tensorflow, but it seems that python executes with the error No module named 'tensorflow' (from the 2nd to 8th of your log).

What confuses me is that you can find tensorflow by executing python -c "import tensorflow as tf; print(tf.version) yourself, but cmake can't. Did you execute python -c "import tensorflow as tf; print(tf.version) in an environment like conda activate xxx or something like that?

zongshibuzai commented 2 years ago

Are there other version requirements for installing spase_operation_kit, for example, cmake > 3.8 ,cuda?

zongshibuzai commented 2 years ago

There was an error when I chose to compile from source code: /hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/event.cc: In member function ‘void SparseOperationKit::Event::TillReady(CUstream_st*&)’: /hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/event.cc:58:52: error: ‘cudaEventWaitDefault’ was not declared in this scope CK_CUDA(cudaStreamWaitEvent(stream, cudaevent, cudaEventWaitDefault)); ^~~~~~~~ /hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/include/common.h:46:22: note: in definition of macro ‘CK_CUDA’ cudaError_t r = (cmd); \ ^~~ /hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/event.cc:58:52: note: suggested alternative: ‘cudaEventDefault’ CK_CUDA(cudaStreamWaitEvent(stream, cudaevent, cudaEventWaitDefault)); ^~~~~~~~ /hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/include/common.h:46:22: note: in definition of macro ‘CK_CUDA’ cudaError_t r = (cmd); \ ^~~ gmake[2]: [CMakeFiles/sparse_operation_kit.dir/kit_cc/kit_cc_infra/src/resources/event.cc.o] Error 1 gmake[1]: [CMakeFiles/sparse_operation_kit.dir/all] Error 2 gmake: *** [all] Error 2 I made the changes as prompted. it is ok. is there no other impact

kanghui0204 commented 2 years ago

Hi ,zongshibuzai, q1:Are there other version requirements for installing spase_operation_kit, for example, cmake > 3.8 ,cuda? the cmake version must higher than 3.8 q2:error: ‘cudaEventWaitDefault’ was not declared in this scope cudaEventWaitDefault is released in cuda runtime API 11.1, please upgrade your cuda runtime version higher than 11.1.

kanghui0204 commented 2 years ago

Hi @zongshibuzai , because this issue is opened for a long time ,and we will close issue now . If you have another question with SOK install or SOK running , you can reopen this issue , and comment.