[Question] Multi-node training encounters Runtime error: unhandled system error ncclGroupEnd()

heroes999 commented 1 year ago

I'm trying to run a simple 2-node wide&deep training, but the program complains runtime errors which I'm not clear:

====================================================Model Init=====================================================
[HCTR][14:56:35.473][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][14:56:35.476][INFO][RK0][main]: Global seed is 59354043
[HCTR][14:56:35.477][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
Traceback (most recent call last):
  File "wdl.py", line 31, in <module>
    model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: unhandled system error
        ncclGroupEnd() at ResourceManagerCore(/hugectr/HugeCTR/src/resource_managers/resource_manager_core.cpp:175)

My environment setup:

2 nodes: one with RTX 4090 and one with RTX 3090
hugectr docker v22.11
docker mode: --net=host, -p 2222:22 so that docker on host A can ssh to docker on host B sucessfully (.ssh/config is set to default to 2222)
mpi command: mpirun --allow-run-as-root -np 2 --hostfile hosts /hugectr_dataset/archive_wdl_model/run.sh; run.sh calls wdl.py

ps: under single node environment, a wide & deep model can be trained successfully

python wdl.py

import hugectr
from mpi4py import MPI

"""
construct model with fixed seed and batchsize=1 so that this model is reproducible
last iter's loss/predict value is attached after the comments of the code
"""

solver = hugectr.CreateSolver(
    max_eval_batches=2,
    batchsize_eval=2,
    batchsize=2,
    lr=0.001,
    vvgpu=[[0],[0]],
    repeat_dataset=True,
    seed = 59354043)
reader = hugectr.DataReaderParams(
    data_reader_type=hugectr.DataReaderType_t.Norm,
    source=["./criteo_data_new/file_list.txt"],
    eval_source="./criteo_data_new/file_list_test.txt",
    check_type=hugectr.Check_t.Sum,
    num_workers = 1,
)
optimizer = hugectr.CreateOptimizer(
    optimizer_type=hugectr.Optimizer_t.Adam,
    update_type=hugectr.Update_t.Global,
    beta1=0.9,
    beta2=0.999,
    epsilon=0.0000001,
)
model = hugectr.Model(solver, reader, optimizer)
model.add(
    hugectr.Input(
        label_dim=1,
        label_name="label",
        dense_dim=13,
        dense_name="dense",
        data_reader_sparse_param_array=[
            hugectr.DataReaderSparseParam("wide_data", 2, True, 1),
            hugectr.DataReaderSparseParam("deep_data", 2, False, 26),
        ],
    )

.... (omit irrelevant parts)

Could anybody please help have a look at this issue? Thanks.

kanghui0204 commented 1 year ago

Hi @heroes999 , Have you tested the nccl-test in your environment? do nccl-tests can passed?

heroes999 commented 1 year ago

Hi @heroes999 , Have you tested the nccl-test in your environment? do nccl-tests can passed?

Ok, I'll give it a shot today

heroes999 commented 1 year ago

@kanghui0204 no, nccl-test cannot pass in my environment. Any suggestion to move forward? ps: seems still terminated on ncclGroupEnd(). Do I need to set NCCL_HOME or some other env vars?

root@xxx:~/project/nccl-tests# mpirun --allow-run-as-root -np 2 --hostfile hosts /root/project/nccl-tests/build/all_reduce_perf -b 8 -e 1M -f 2 -g 1 -t 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  20597 on devops-System-Product-Name device  0 [0x09] NVIDIA GeForce RTX 4070 Ti
#  Rank  1 Group  0 Pid   5713 on user-System-Product-Name device  0 [0x09] NVIDIA GeForce RTX 3090
user-System-Product-Name: Test NCCL failure common.cu:958 'internal error'
 .. user-System-Product-Name pid 5713: Test failure common.cu:842
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[7756,1],1]
  Exit code:    3
--------------------------------------------------------------------------

heroes999 commented 1 year ago

@kanghui0204 Are there other easier ways to interconnect two hugectr dockers on two different nodes? I guess the hiding reason is that I use docker run --net=host to share network with physical node(host) so that one docker can reach the other with its host ip addrs, but not very sure.

kanghui0204 commented 1 year ago

@heroes999 , I think you can try this guide,but I think you still need to figure out why NCCL break down . About your question , I can't figure out what the problem is from your error log alone,ssh access between docker containers is a necessary condition to be able to use NCCL, but seems your node and environments have other problem.

RayWang96 commented 1 year ago

@heroes999, is this problem solved?

heroes999 commented 1 year ago

@RayWang96 Not yet. Any other easier ways to interconnect two hugectr dockers on two different nodes? I bet the problem is related to my network config (docker and host with same IP, but different ssh port, host is 22, docker is 2222)

kanghui0204 commented 1 year ago

@heroes999 I think you'd better to open a issue in NCCL repo, and ask them to see how to solve the problem. FYI @RayWang96

heroes999 commented 1 year ago

@kanghui0204 Ok, I would turn to NCCL repo first. Close it.

NVIDIA-Merlin / HugeCTR

[Question] Multi-node training encounters Runtime error: unhandled system error ncclGroupEnd() #404