Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

NCCL issue with illegal memory access #31

Closed hanbinhu closed 3 years ago

hanbinhu commented 4 years ago

Command to run: bfrun -np 4 python examples/pytorch_benchmark.py --dist-optimizer=allreduce Commit aca0fec and after triggers nccl_controller.cc complaining an illegal memory access was encountered and other kinds of issues.

BichengYing commented 4 years ago

It is related to the NCCL version. However, not certain with the reason yet.

Bluefog-Lib commented 4 years ago

Maybe related to issue #44

BichengYing commented 3 years ago

It should be resolved completely