dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.19k stars 2.99k forks source link

NCCL error when trying to run GraphBolt jobs with >1 trainer per worker #7426

Open thvasilo opened 1 month ago

thvasilo commented 1 month ago

🐛 Bug

I've observed an error when trying to use GraphBolt with --num-trainers >1. In this case I'm using DistGB through GraphStorm, so not sure if it's GSF or GB that the root cause. It's hard to make out from the unordered stacktrace, but listing here:

    Traceback (most recent call last):
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 190, in <module>
        work = group.allreduce([tensor], opts)
    torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
    ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
    Last error:
    Duplicate GPU detected : rank 10 and rank 8 both on CUDA device 160
        super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
    Client[9] in group[0] is exiting...
    Client[8] in group[0] is exiting...
            work = group.allreduce([tensor], opts)if dist_sum(len(val_idx)) > 0:

      File "/graphstorm/python/graphstorm/dataloading/utils.py", line 80, in dist_sum
    torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
    ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
    Last error:
    Duplicate GPU detected : rank 23 and rank 16 both on CUDA device 160
    Client[15] in group[0] is exiting...
        dist.all_reduce(size, dist.ReduceOp.SUM)
      File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
        super(GSgnnNodeData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
        return func(*args, **kwargs)
      File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
        self.prepare_data(self._g)
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
    Client[7] in group[0] is exiting...
        main(gs_args)
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
        main(gs_args)
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
        train_data = GSgnnNodeTrainData(config.graph_name,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
        train_data = GSgnnNodeTrainData(config.graph_name,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
        super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
        super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
        main(gs_args)    super(GSgnnNodeData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__

      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
        super(GSgnnNodeData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
    Client[30] in group[0] is exiting...
        if dist_sum(len(val_idx)) > 0:
      File "/graphstorm/python/graphstorm/dataloading/utils.py", line 80, in dist_sum
        self.prepare_data(self._g)
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
        train_data = GSgnnNodeTrainData(config.graph_name,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
        self.prepare_data(self._g)
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
        dist.all_reduce(size, dist.ReduceOp.SUM)
      File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper                                                                                                                                                                                                                                                              work = group.allreduce([tensor], opts)
    torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
    ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Additional context

Rhett-Ying commented 1 month ago

@thvasilo Does NCCL + num_trainers>1 works well for DistDGL(no graphbolt)? I think it's not related to DistGB. And it seems to be incomplete support of NCCL.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

jermainewang commented 1 week ago

This is a confirmed issue. The workaround is always setting num_trainers to 1.