Open thvasilo opened 1 month ago
@thvasilo Does NCCL + num_trainers>1
works well for DistDGL(no graphbolt)? I think it's not related to DistGB. And it seems to be incomplete support of NCCL.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
This is a confirmed issue. The workaround is always setting num_trainers
to 1.
🐛 Bug
I've observed an error when trying to use GraphBolt with
--num-trainers >1
. In this case I'm using DistGB through GraphStorm, so not sure if it's GSF or GB that the root cause. It's hard to make out from the unordered stacktrace, but listing here:To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
conda
,pip
, source): pipAdditional context