Closed anj-s closed 3 years ago
Briefly checked the code, @pleasantrabbit I think we should add get_cuda_dirs
to build_torch_extension
.
Briefly checked the code, @pleasantrabbit I think we should add
get_cuda_dirs
tobuild_torch_extension
.
Indeed. Will update it.
I updated setup.py as seen in PR but I am running into the following error:
2021-04-11 08:39:27.543284: D byteps/common/global.cc:320] Shutdown BytePS: start to clean the resources (rank=1)
Traceback (most recent call last):
File "byteps/example/pytorch/train_mnist_byteps.py", line 108, in <module>
bps.broadcast_parameters(model.state_dict(), root_rank=0)
File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/__init__.py", line 287, in broadcast_parameters
handle = byteps_push_pull(p, average=False, name=prefix+name)
File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 174, in push_pull_async_inplace
return _do_push_pull_async(tensor, tensor, average, name, version, priority)
File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 71, in _do_push_pull_async
function = _check_function(_push_pull_function_factory, tensor)
File "/private/home/anj/.conda/envs/byteps_env_clone/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/torch/ops.py", line 57, in _check_function
raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.FloatTensor is not supported.
This goes away if I use /usr/local/cuda. Is there something else I am missing?
Finally figured this out: You need to add the path that you set in BYTEPS_CUDA_HOME to your $PATH env var in addition to the PR changes above.
Describe the bug I get the following error when attempting to run
python setup.py install
.INFO: Above error indicates that this PyTorch installation does not support CUDA. building 'byteps.torch.c_lib' extension creating build/temp.linux-x86_64-3.8/byteps/torch gcc -pthread -B /private/home/anj/.conda/envs/byteps_env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=0 -DTORCH_VERSION=1007001000 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H=1 -I3rdparty/ps-lite/include -I/public/apps/NCCL/2.7.8-1/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/TH -I/private/home/anj/.conda/envs/byteps_env/lib/python3.8/site-packages/torch/include/THC -I/private/home/anj/.conda/envs/byteps_env/include/python3.8 -c byteps/common/common.cc -o build/temp.linux-x86_64-3.8/byteps/common/common.o -std=c++14 -fPIC -Ofast -Wall -fopenmp -march=native -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from byteps/common/common.cc:20: byteps/common/common.h:21:10: fatal error: cuda_runtime.h: No such file or directory 21 | #include
| ^
~~~compilation terminated. INFO: Unable to build PyTorch plugin, will skip it.This works if I use a symlink to point to /usr/local/cuda instead. For some reason setting another path does not work. I also did not see
build_torch_extension
callingget_cuda_dirs
in setup.py. How does it know which path cuda is set to?To Reproduce Steps to reproduce the behavior: export BYTEPS_NCCL_HOME=/.../NCCL/2.7.8-1 export BYTEPS_CUDA_HOME=/.../cuda/11.0 git clone --recurse-submodules https://github.com/bytedance/byteps cd byteps/ python setup.py install
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information): OS: Ubuntu GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) CUDA and NCCL version: CUDA: 11.0 NCCL: 2.7.8 Framework (TF, PyTorch, MXNet): PyTorch 1.8
Additional context Add any other context about the problem here.