bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

Unable to install Pytorch plugin when running python setup.py install #383

Closed anj-s closed 3 years ago

anj-s commented 3 years ago

Describe the bug I get the following error when attempting to run python setup.py install.

INFO: Above error indicates that this PyTorch installation does not support CUDA. building 'byteps.torch.c_lib' extension creating build/temp.linux-x86_64-3.8/byteps/torch gcc -pthread -B /private/home/anj/.conda/envs/byteps_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=0 -DTORCH_VERSION=1007001000 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H=1 -I3rdparty/ps-lite/include -I/public/apps/NCCL/2.7.8-1/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/TH -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/THC -I/private/home/anj/.conda/envs/byteps_env/include/python3.8 -c byteps/common/common.cc -o build/temp.linux-x86_64-3.8/byteps/common/common.o -std=c++14 -fPIC -Ofast -Wall -fopenmp -march=native -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from byteps/common/common.cc:20: byteps/common/common.h:21:10: fatal error: cuda_runtime.h: No such file or directory 21 | #include | ^~~~ compilation terminated. INFO: Unable to build PyTorch plugin, will skip it.

This works if I use a symlink to point to /usr/local/cuda instead. For some reason setting another path does not work. I also did not see build_torch_extension calling get_cuda_dirs in setup.py. How does it know which path cuda is set to?

To Reproduce Steps to reproduce the behavior: export BYTEPS_NCCL_HOME=/.../NCCL/2.7.8-1 export BYTEPS_CUDA_HOME=/.../cuda/11.0 git clone --recurse-submodules https://github.com/bytedance/byteps cd byteps/ python setup.py install

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information): OS: Ubuntu GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) CUDA and NCCL version: CUDA: 11.0 NCCL: 2.7.8 Framework (TF, PyTorch, MXNet): PyTorch 1.8

Additional context Add any other context about the problem here.

bobzhuyb commented 3 years ago

Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs to build_torch_extension.

pleasantrabbit commented 3 years ago

Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs to build_torch_extension.

Indeed. Will update it.

anj-s commented 3 years ago

I updated setup.py as seen in PR but I am running into the following error:

2021-04-11 08:39:27.543284: D byteps/common/global.cc:320] Shutdown BytePS: start to clean the resources (rank=1)
Traceback (most recent call last):
  File "byteps/example/pytorch/train_mnist_byteps.py", line 108, in <module>
    bps.broadcast_parameters(model.state_dict(), root_rank=0)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/__init__.py", line 287, in broadcast_parameters
    handle = byteps_push_pull(p, average=False, name=prefix+name)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 174, in push_pull_async_inplace
    return _do_push_pull_async(tensor, tensor, average, name, version, priority)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 71, in _do_push_pull_async
    function = _check_function(_push_pull_function_factory, tensor)
  File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 57, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.FloatTensor is not supported.

This goes away if I use /usr/local/cuda. Is there something else I am missing?

anj-s commented 3 years ago

Finally figured this out: You need to add the path that you set in BYTEPS_CUDA_HOME to your $PATH env var in addition to the PR changes above.