Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
292 stars 71 forks source link

ImportError: /root/miniconda3/envs/bluefog/lib/python3.8/site-packages/bluefog/torch/mpi_lib.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6deviceEv #110

Open yangxuanfei opened 2 years ago

yangxuanfei commented 2 years ago

Running the tutorial times the above errors. What is the reason

BichengYing commented 2 years ago

Hi, can you post the environment settings? That error probably means at::Tensor::device can be found in the symbol. at is ATen library in the PyTorch library. So I guess it might be related to your PyTorch version or when building the BlueFog library, it failed to link the PyTorch symbols. It will be helpful if you post how you install the BlueFog Library.

yangxuanfei commented 2 years ago

I use anaconda. The python version is 3.8, and then the pytorch version is 1.8. I download the installation package from the official website and install it locally. The installation package is torch-1.8.1 + cu102-cp38-cp38-linux x86 64.whl。 Is it the problem with this installation package? Has nothing to do with my openmpi?

BichengYing commented 2 years ago
  1. that should not be related to openmpi because it failed to link the symbol (that is in C++ side since our backend depends on the PyTorch).
  2. I don't have a good idea why it cannot find the symbol. (first time saw this) But I would suggest downgrading torch to 1.5?
  3. I found this similar issue https://github.com/aim-uofa/AdelaiDet/issues/181 Try to build it through the github maybe checkout this page https://github.com/Bluefog-Lib/bluefog/wiki/BlueFog-Development-Guide