Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

when I Install Bluefog from Pip (GPU),some error happens #107

Closed lkzs closed 2 years ago

lkzs commented 2 years ago

During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/tmp/pip-install-teoqshks/bluefog_c270e8e510a24fe89dcf2ef326e7cd47/setup.py", line 585, in build_extensions
      build_torch_extension(self, options, torch_version)
    File "/tmp/pip-install-teoqshks/bluefog_c270e8e510a24fe89dcf2ef326e7cd47/setup.py", line 503, in build_torch_extension
      options['COMPILE_FLAGS'])
    File "/tmp/pip-install-teoqshks/bluefog_c270e8e510a24fe89dcf2ef326e7cd47/setup.py", line 266, in get_nccl_dirs
      'NCCL 2.4 library or its later version was not found (see error above).\n'
  distutils.errors.DistutilsPlatformError: NCCL 2.4 library or its later version was not found (see error above).
  Please specify correct NCCL location with the BLUEFOG_NCCL_HOME environment variable or combination of BLUEFOG_NCCL_INCLUDE and BLUEFOG_NCCL_LIB environment variables.

  BLUEFOG_NCCL_HOME - path where NCCL include and lib directories can be found
  BLUEFOG_NCCL_INCLUDE - path to NCCL include directory
  BLUEFOG_NCCL_LIB - path to NCCL lib directory

  error: Neither PyTorch or TensorFlow plugins were built. See errors above.
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

?Encountered error while trying to install package. \u2570\u2500> bluefog

note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure.

hanbinhu commented 2 years ago

Hi,

Thanks for trying out our package. It seems that it is related to NCCL installation. Please let us know what is installed NCCL version and the corresponding installation path. You can refer to the official NCCL guide for the installation. The suggested NCCL version should be above 2.7.

In addition, please let us know what are the steps you follow to install BlueFog. We recommend the steps in this page.

lkzs commented 2 years ago

The nccl version is 2.12 , because I don't have root,I install nccl by conda. I have tried install nccl by github source code , but ' make -j12 src.build BUILDDIR=/home/chenz/software/nccl CUDA_HOME=/usr/local/cuda-10.2 NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35" ' command error Thans for reply.

BichengYing commented 2 years ago

In this case, you have add BLUEFOG_NCCL_HOME=<the path you installed> before the pip install to tell BlueFog where to find the NCCL library. Please try it. If it still doesn't work, let's know

lkzs commented 2 years ago

I installed nccl by command 'conda install nccl' ,but don't have the nccl_include or nccl_lib file .Maybe this way is wrong. I don't have the root , how can I install nccl and bluefog?

lkzs commented 2 years ago

I installed nccl by command 'conda install nccl' ,but don't have the nccl_include or nccl_lib file .Maybe this way is wrong. I don't have the root , how can I install nccl and bluefog? This problem have solved . I have successfully installed bluefog-0.3.0 ,by command BLUEFOG_NCCL_HOME=/home/zhenfeng/software/nccl BLUEFOG_WITH_NCCL=1 pip install --no-cache-dir bluefog . But when I run [bluefog-tutorial] "Applying BlueFog on Deep Learning problem(High Level API Introduction).ipynb" by command : 'ibfrun start -np 4 ', some error happend when run the 'Start decentralized trainning' cell . The error follows :

Failed, NCCL error bluefog/common/nccl_controller.cc:750 'invalid usage' Failed, NCCL error bluefog/common/nccl_controller.cc:750 'invalid usage'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

2022-04-13 00:30:04.355 [KernelNanny.0] Parent 45415 exited with status None. 2022-04-13 00:30:04.356 [KernelNanny.0] Notifying Hub that our parent has shut down

mpirun noticed that process rank 3 with PID 0 on node Server8 exited on signal 11 (Segmentation fault).

Try to kill ipcontroller process but cannot retrieve its pid. Maybe it is already been stopped. removed ipengine_config file

My environment is Ubuntu-18.04 Nccl-2.12 Openmpi-4.0.7 Bluefog-0.3.0 Why does this error happens? Is because version of Nccl inappropriate?Can anyone help me ? Thanks very much !

BichengYing commented 2 years ago

Same comment in #108. Discussed offline, it is more likely the CUDA and NCCL installation issue instead of BlueFog's. Feel free to re-open if necessary.