Closed silpara closed 2 years ago
Has the NCCL library been installed in /usr/local/cuda-11.2/... ?
If yes, could you try set environment variable NCCL_DIR=/usr/local/cuda-11.2/
?
There is no separate directory for NCCL at /usr/local/cuda-11.2/
but nccl and libnccl files are present in usr/local/cuda-11.2/include/
and /usr/local/cuda-11.2/lib/
respectively. Setting NCCL_DIR=/usr/local/cuda-11.2/
did not work either.
@silpara Could you give the specific installation location of nccl.h
and libnccl.so
?
For example, in the nvcr.io/nvidia/tensorflow:22.05-tf2-py3
container:
nccl.h
is at /usr/include/nccl.h
libnccl.so
is at /usr/lib/x86_64-linux-gnu/libnccl.so
The following cmake file describes where sok will look for nccl.h and libnccl.so: https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/sparse_operation_kit/cmakes/FindNCCL.cmake
set(NCCL_INC_PATHS
/usr/include
/usr/local/include
$ENV{NCCL_DIR}/include
)
set(NCCL_LIB_PATHS
/lib
/lib64
/usr/lib
/usr/lib64
/usr/local/lib
/usr/local/lib64
$ENV{NCCL_DIR}/lib
)
find_path(NCCL_INCLUDE_DIR NAMES nccl.h PATHS ${NCCL_INC_PATHS})
find_library(NCCL_LIBRARIES NAMES nccl PATHS ${NCCL_LIB_PATHS})
If your NCCL is not installed in standard path, then you can tell cmake by setting NCCL_DIR
, this is why I said you can try setting NCCL_DIR=/usr/local/cuda-11.2/
.
But if your nccl.h
and libnccl.so
are stored in a nested folder under /usr/local/cuda-11.2
, cmake will still can't find them.
The easiest way to solve this is to create a soft link like:
ln /usr/local/cuda11.2/.../nccl.h /usr/include/nccl.h
ln /usr/local/cuda11.2/.../libnccl.so* /usr/lib/
Your suggestion seems to work and cmake is able to locate NCCL but another error is coming now
Defaulting to user installation because normal site-packages is not writeable
Collecting sparse_operation_kit
Using cached sparse_operation_kit-1.1.2-py3-none-any.whl
Collecting merlin-sok
Using cached merlin-sok-1.1.3.tar.gz (152 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: merlin-sok
Building wheel for merlin-sok (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for merlin-sok (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [158 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/sparse_operation_kit
copying ./sparse_operation_kit/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
copying ./sparse_operation_kit/kit_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/_version.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/embedding_layer_handle.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/context_scope.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/embedding_variable_v2.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/inplace_initializer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/embedding_variable_v1.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/initialize.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
copying ./sparse_operation_kit/core/graph_keys.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/core
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
copying ./sparse_operation_kit/operations/compat_ops_lib.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
copying ./sparse_operation_kit/operations/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/operations
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
copying ./sparse_operation_kit/saver/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
copying ./sparse_operation_kit/saver/Saver.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/saver
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
copying ./sparse_operation_kit/embeddings/all2all_dense_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
copying ./sparse_operation_kit/embeddings/embedding_ops.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
copying ./sparse_operation_kit/embeddings/tf_distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
copying ./sparse_operation_kit/embeddings/distributed_embedding.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
copying ./sparse_operation_kit/embeddings/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
copying ./sparse_operation_kit/embeddings/get_embedding_op.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/embeddings
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
copying ./sparse_operation_kit/optimizers/optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
copying ./sparse_operation_kit/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
copying ./sparse_operation_kit/optimizers/utils.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
copying ./sparse_operation_kit/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
copying ./sparse_operation_kit/optimizers/base_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/optimizers
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
copying ./sparse_operation_kit/tf/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
copying ./sparse_operation_kit/tf/keras/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
copying ./sparse_operation_kit/tf/keras/mixed_precision/loss_scale_optimizer.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
copying ./sparse_operation_kit/tf/keras/mixed_precision/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/mixed_precision
creating build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
copying ./sparse_operation_kit/tf/keras/optimizers/common.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
copying ./sparse_operation_kit/tf/keras/optimizers/__init__.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
copying ./sparse_operation_kit/tf/keras/optimizers/lazy_adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
copying ./sparse_operation_kit/tf/keras/optimizers/adam.py -> build/lib.linux-x86_64-3.9/sparse_operation_kit/tf/keras/optimizers
running build_ext
-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is NVIDIA 11.2.152
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Building Sparse Operation Kit from source.
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found version "11.2")
-- Found NCCL: /usr/include
-- Found NCCL (include: /usr/include, library: /usr/lib/libnccl.so)
CMake Error at cmakes/FindTensorFlow.cmake:23 (string):
string sub-command REPLACE requires at least four arguments.
Call Stack (most recent call first):
CMakeLists.txt:26 (find_package)
-- TensorFlow version =
CMake Error at cmakes/FindTensorFlow.cmake:30 (string):
string sub-command REGEX, mode MATCH needs at least 5 arguments total to
command.
Call Stack (most recent call first):
CMakeLists.txt:26 (find_package)
CMake Error at cmakes/FindTensorFlow.cmake:31 (string):
string sub-command REPLACE requires at least four arguments.
Call Stack (most recent call first):
CMakeLists.txt:26 (find_package)
CMake Error at cmakes/FindTensorFlow.cmake:32 (string):
string sub-command REPLACE requires at least four arguments.
Call Stack (most recent call first):
CMakeLists.txt:26 (find_package)
CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find TensorFlow (missing: TF_LINK_DIR)
Call Stack (most recent call first):
/usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
cmakes/FindTensorFlow.cmake:35 (find_package_handle_standard_args)
CMakeLists.txt:26 (find_package)
-- Configuring incomplete, errors occurred!
See also "/tmp/pip-install-t34pucrf/merlin-sok_cc05886f9a5244318d2001913f09174b/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
See also "/tmp/pip-install-t34pucrf/merlin-sok_cc05886f9a5244318d2001913f09174b/build/lib.linux-x86_64-3.9/CMakeFiles/CMakeError.log".
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
main()
File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/usr/local/lib/python3.9/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 261, in build_wheel
return _build_backend().build_wheel(wheel_directory, config_settings,
File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 230, in build_wheel
return self._build_with_temp_dir(['bdist_wheel'], '.whl',
File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 215, in _build_with_temp_dir
self.run_setup()
File "/usr/local/lib/python3.9/dist-packages/setuptools/build_meta.py", line 158, in run_setup
exec(compile(code, __file__, 'exec'), locals())
File "setup.py", line 182, in <module>
setup(
File "/usr/local/lib/python3.9/dist-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 148, in setup
return run_commands(dist)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/core.py", line 163, in run_commands
dist.run_commands()
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 967, in run_commands
self.run_command(cmd)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/tmp/pip-build-env-8b6qrdse/overlay/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 299, in run
self.run_command('build')
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.9/dist-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/usr/local/lib/python3.9/dist-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "setup.py", line 102, in build_extensions
subprocess.check_call("cmake {} {} && make -j{}".format(cmake_args,
File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cmake -DSM='70;75;80' -DUSE_NVTX=OFF -DSOK_ASYNC=ON -DSOK_UNIT_TEST=OFF -DCMAKE_BUILD_TYPE=Release /tmp/pip-install-t34pucrf/merlin-sok_cc05886f9a5244318d2001913f09174b && make -j$(nproc)' returned non-zero exit status 1.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for merlin-sok
Failed to build merlin-sok
ERROR: Could not build wheels for merlin-sok, which is required to install pyproject.toml-based projects
It looks like cmake can't find lib of tensorflow now.
SOK use these commands to locate the lib of tensorflow:
python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_compile_flags()))"
python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_link_flags()))"
python -c "import tensorflow as tf; print(tf.__version__)"
Could you try these commands to see what will print?
In nvcr.io/nvidia/tensorflow:22.05-tf2-py3
container, the outputs are like:
-I/usr/local/lib/python3.8/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -DEIGEN_MAX_ALIGN_BYTES=64
-L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2
2.8.0
I can't get access to a AWS instance now, but I guess the python
in your container corresponds to python2 actually, and you will input python3
when you use tensorflow. This caused the above commands to fail.
If my guess is correct, you can create an alias before running cmake:
alias python=python3
And you can also unalias it after the installation.
unalias python
I also encountered the same mistake above, please ask how to solve it now?
I also encountered the same mistake above, please ask how to solve it now?
cmake cannot find nccl or tensorflow? @zongshibuzai
At first, cmake cannot find nccl. when use ln /usr/local/cuda11.2/.../nccl.h /usr/include/nccl.h ln /usr/local/cuda11.2/.../libnccl.so* /usr/lib/ now cmake cannot find tensorflow.
Could you try these commands to see what will print?
python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_compile_flags()))"
python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_link_flags()))"
python -c "import tensorflow as tf; print(tf.__version__)"
the outputs are like: -I/usr/local/python3/lib/python3.7/site-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -DEIGEN_MAX_ALIGN_BYTES=64 -L/usr/local/python3/lib/python3.7/site-packages/tensorflow -l:libtensorflow_framework.so.2 2.8.0
Does the error logs are also like:
...
CMake Error at cmakes/FindTensorFlow.cmake:23 (string):
string sub-command REPLACE requires at least four arguments.
...
CMake Error at cmakes/FindTensorFlow.cmake:30 (string):
string sub-command REGEX, mode MATCH needs at least 5 arguments total to
command.
...
And I want to double check, you got 2.8.0
with python -c "import tensorflow as tf; print(tf.__version__)"
, not python3 ...
, right?
the error logs :
ModuleNotFoundError: No module named 'tensorflow'
Traceback (most recent call last):
File "
-- TensorFlow version = CMake Error at cmakes/FindTensorFlow.cmake:30 (string): string sub-command REGEX, mode MATCH needs at least 5 arguments total to command. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)
CMake Error at cmakes/FindTensorFlow.cmake:31 (string): string sub-command REPLACE requires at least four arguments. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)
CMake Error at cmakes/FindTensorFlow.cmake:32 (string): string sub-command REPLACE requires at least four arguments. Call Stack (most recent call first): CMakeLists.txt:26 (find_package)
CMake Error at /usr/local/share/cmake-3.8/Modules/FindPackageHandleStandardArgs.cmake:137 (message): Could NOT find TensorFlow (missing: TF_LINK_DIR) Call Stack (most recent call first): /usr/local/share/cmake-3.8/Modules/FindPackageHandleStandardArgs.cmake:377 (_FPHSA_FAILURE_MESSAGE) cmakes/FindTensorFlow.cmake:35 (find_package_handle_standard_args) CMakeLists.txt:26 (find_package)
bash-4.2# python -c "import tensorflow as tf; print(tf.version)" 2.8.0 wo got 2.8.0
I can see why cmake fails, because we let cmake execute a subprocess(script in here) like python -c "import tensorflow as tf; print(tf.version)
to get the location of tensorflow, but it seems that python executes with the error No module named 'tensorflow'
(from the 2nd to 8th of your log).
What confuses me is that you can find tensorflow by executing python -c "import tensorflow as tf; print(tf.version)
yourself, but cmake can't. Did you execute python -c "import tensorflow as tf; print(tf.version)
in an environment like conda activate xxx
or something like that?
Are there other version requirements for installing spase_operation_kit, for example, cmake > 3.8 ,cuda?
There was an error when I chose to compile from source code:
/hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/event.cc: In member function ‘void SparseOperationKit::Event::TillReady(CUstream_st*&)’:
/hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/event.cc:58:52: error: ‘cudaEventWaitDefault’ was not declared in this scope
CK_CUDA(cudaStreamWaitEvent(stream, cudaevent, cudaEventWaitDefault));
^~~~~~~~
/hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/include/common.h:46:22: note: in definition of macro ‘CK_CUDA’
cudaError_t r = (cmd); \
^~~
/hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/src/resources/event.cc:58:52: note: suggested alternative: ‘cudaEventDefault’
CK_CUDA(cudaStreamWaitEvent(stream, cudaevent, cudaEventWaitDefault));
^~~~~~~~
/hugectr/sparse_operation_kit/kit_cc/kit_cc_infra/include/common.h:46:22: note: in definition of macro ‘CK_CUDA’
cudaError_t r = (cmd); \
^~~
gmake[2]: [CMakeFiles/sparse_operation_kit.dir/kit_cc/kit_cc_infra/src/resources/event.cc.o] Error 1
gmake[1]: [CMakeFiles/sparse_operation_kit.dir/all] Error 2
gmake: *** [all] Error 2
I made the changes as prompted. it is ok. is there no other impact
Hi ,zongshibuzai, q1:Are there other version requirements for installing spase_operation_kit, for example, cmake > 3.8 ,cuda? the cmake version must higher than 3.8 q2:error: ‘cudaEventWaitDefault’ was not declared in this scope cudaEventWaitDefault is released in cuda runtime API 11.1, please upgrade your cuda runtime version higher than 11.1.
Hi @zongshibuzai , because this issue is opened for a long time ,and we will close issue now . If you have another question with SOK install or SOK running , you can reopen this issue , and comment.
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
Logs
Environment (please complete the following information):
Additional context Already tried adding
NCCL_INCLUDE_DIR=/usr/local/cuda-11.2/include/
andNCCL_LIBRARIES=/usr/local/cuda-11.2/lib/
as environment variables but getting the same error.