Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Win-Ops on Docker has a memory issue (OpenMPI implementation). #3

Closed BichengYing closed 4 years ago

BichengYing commented 4 years ago

Also win-ops communication with 4.0 openmpi is not supported yet.

BichengYing commented 4 years ago

Temporarily, we use BLUEFOG_WIN_ON_CPU=1 flag so that Bluefog will copy gpu tensor to cpu, then communication through the cpu, after communication is done, transform it back to gpu.

BichengYing commented 4 years ago

Problem solved. Docker has to be run under privileged mode, namely, just need to add "--privileged" flag.