Closed apeforest closed 4 years ago
We may refactor https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindNCCL.cmake to improve autodetection. In the meantime see the variables used for searching. If you set one of them to your nccl base directory, it should find nccl successfully?
I experienced this too ... try using -DUSE_NCCL=1 -DUSE_NCCL_PATH=/usr/local/cuda/include (or as @leezu your NCCL path)
@mjsML Thanks, using that flag worked for me. @guanxinq or @ChaiBapchya interested in fixing FindNCCL.cmake as suggested? :)
I took a look at this auto-detection issue.
To solve this particular case, I added a check for symlink (if UNIX) - https://github.com/ChaiBapchya/incubator-mxnet/blob/nccl_autodetect/cmake/Modules/FindNCCL.cmake
If this is enough, I can submit a PR.
However, I'm not sure if it is complete. Coz I took a look at https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindCUDAToolkit.cmake It has a fairly long drawn way of finding the Cuda Toolkit
Is this what's needed? @leezu @apeforest In that case it makes sense to "factor" out this check as it will be used at 2 places (findCudatoolkit and findNCCL)
@apeforest could you provide some background if NCCL is installed at /usr/local/cuda/include
by default?
@ChaiBapchya your change seems to rely on CUDA_TOOLKIT_ROOT_DIR
, but this variable is not among the variables exported by FindCUDAToolkit
. In fact, you can see it's explicitly unset:
https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindCUDAToolkit.cmake#L708
Instead, let's use the result variables
Specifically CUDAToolkit_INCLUDE_DIRS
and CUDAToolkit_LIBRARY_DIR
? Or would the nccl library not be at the CUDAToolkit_LIBRARY_DIR
?
Besides using the CUDAToolkit
variables as additional defaults to find nccl, the NCCL_ROOT
variable needs to be examined as per https://cmake.org/cmake/help/latest/policy/CMP0074.html
(which is done correctly currently I think)
In DLAMI, nccl is installed by default in the cuda directory: /usr/local/cuda/include/nccl.h
However, if user installed nccl manually by themselves, sudo apt install libnccl2 libnccl-dev, you may use the sudo dpkg-query -L libnccl-dev
to find where it is.
https://askubuntu.com/questions/1134732/where-is-nccl-h
I would suggest @ChaiBapchya to first search /usr/local/cuda/include/
. If not found, try sudo dpkg-query -L libnccl-dev
instead. Would that work?
Thanks @ChaiBapchya for volunteering to work on this!
If not found, try
sudo dpkg-query -L libnccl-dev
instead.
That's would only work on Debian based platforms and only for one particular way of installing nccl on these systems. I think it's safe to require users to set NCCL_ROOT
if they manually installed nccl to a different path.
To improve the user experience, we may fall-back to building nccl ourselves if nccl is required and not found. Pytorch does that for example.
Ya. Even when I looked at different autodetection files for cmake used in various other open-source frameworks
They have similar approach. Either look for default path, env var (NCCL_ROOT) or /usr/local/cuda
Agree with @leezu I haven't seen "dpkg-query" or equivalent "find" commands used in cmake. They are more of command line searches. In cmake, there's find_path, find_library which does similar job.
Thanks @apeforest @leezu for chiming in!
@ChaiBapchya BTW, unfortunately a lot of CMake usage out in the wild does not meet the modern CMake bar but is leftover from the early days of CMake. While not covering all use-cases of MXNet, sometimes we can refer to https://cliutils.gitlab.io/modern-cmake/ for best practices
Description
If I build mxnet with NCCL using cmake, it failed with "Could not find NCCL libraries" even though my NCCL is installed at /usr/local/cuda/include
Reproduce
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: