ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 54 forks source link

Pytorch distributed training with MPI backend #453

Open 401qingkong opened 5 years ago

401qingkong commented 5 years ago

Hi,Does ROCM pytorch support distributed training with MPI backend? Now pytorch can't work with MPI. The error information is as follows: RuntimeError: CUDA tensor detected and the MPI used doesn't have CUDA-aware MPI support what's the problem? could you give me some advice. Thanks :)

iotamudelta commented 5 years ago

There is currently no support for distributed training with MPI as a backend.

You can for the Caffe2 backend already use gloo, for the PyTorch backend we are in the process of upstreaming RCCL enablement.