bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.58k stars 487 forks source link

Successfully installed BytePS but cannot import byteps.torch or byteps.tensorflow #428

Closed hamidralmasi closed 2 years ago

hamidralmasi commented 2 years ago

Describe the bug I have installed BytePS successfully. I have specified the path for NCCL and CUDA and have working tensorflow and pytorch on my system. I'm building BytePS from source and I can see: "Tensorflow extension is built successfully." and "PyTorch extension is built successfully" in the output. I can also successfully import byteps in python after installation, however, to start my distributed training workloads I need to import byteps.torch and byteps.tensorflow that I get the following errors for:

import byteps.pytorch Traceback (most recent call last): File "", line 1, in File "/users/halmas3/byteps/byteps/torch/init.py", line 24, in from byteps.torch.ops import push_pull_async_inplace as byteps_push_pull File "/users/halmas3/byteps/byteps/torch/ops.py", line 29, in from byteps.torch import c_lib ImportError: cannot import name 'c_lib' from partially initialized module 'byteps.torch' (most likely due to a circular import) (/users/halmas3/byteps/byteps/torch/init.py)

import byteps.tensorflow Traceback (most recent call last): File "", line 1, in File "/users/halmas3/byteps/byteps/tensorflow/init.py", line 27, in from byteps.tensorflow.ops import broadcast, _push_pull File "/users/halmas3/byteps/byteps/tensorflow/ops.py", line 53, in C_LIB = _load_library('c_lib' + get_ext_suffix()) File "/users/halmas3/byteps/byteps/tensorflow/ops.py", line 49, in _load_library library = load_library.load_op_library(filename) File "/users/halmas3/.local/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: /users/halmas3/byteps/byteps/tensorflow/c_lib.cpython-38-x86_64-linux-gnu.so: cannot open shared object file: No such file or directory

Environment:

How can I resolve the issue for each one of these cases? Thank you so much for your help!

hamidralmasi commented 2 years ago

I think I was inside the local byteps folder when importing these and python was not looking at the installed byteps, but rather confused with the local byteps folder inside. Stepping out resolved the issue.

ChaosPengs commented 1 year ago

sorry to interrupt but I met the same error. what does 'out the byteps folder' mean? Does it mean if I run the python file out of byteps folder, it will work? I try it but still confront the same error.