Closed heroes999 closed 1 year ago
Hi @heroes999 , Have you tested the nccl-test in your environment? do nccl-tests can passed?
Hi @heroes999 , Have you tested the nccl-test in your environment? do nccl-tests can passed?
Ok, I'll give it a shot today
@kanghui0204 no, nccl-test cannot pass in my environment. Any suggestion to move forward? ps: seems still terminated on ncclGroupEnd(). Do I need to set NCCL_HOME or some other env vars?
root@xxx:~/project/nccl-tests# mpirun --allow-run-as-root -np 2 --hostfile hosts /root/project/nccl-tests/build/all_reduce_perf -b 8 -e 1M -f 2 -g 1 -t 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 20597 on devops-System-Product-Name device 0 [0x09] NVIDIA GeForce RTX 4070 Ti
# Rank 1 Group 0 Pid 5713 on user-System-Product-Name device 0 [0x09] NVIDIA GeForce RTX 3090
user-System-Product-Name: Test NCCL failure common.cu:958 'internal error'
.. user-System-Product-Name pid 5713: Test failure common.cu:842
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[7756,1],1]
Exit code: 3
--------------------------------------------------------------------------
@kanghui0204 Are there other easier ways to interconnect two hugectr dockers on two different nodes? I guess the hiding reason is that I use docker run --net=host to share network with physical node(host) so that one docker can reach the other with its host ip addrs, but not very sure.
@heroes999 , I think you can try this guide,but I think you still need to figure out why NCCL break down . About your question , I can't figure out what the problem is from your error log alone,ssh
access between docker containers is a necessary condition to be able to use NCCL, but seems your node and environments have other problem.
@heroes999, is this problem solved?
@RayWang96 Not yet. Any other easier ways to interconnect two hugectr dockers on two different nodes? I bet the problem is related to my network config (docker and host with same IP, but different ssh port, host is 22, docker is 2222)
@heroes999 I think you'd better to open a issue in NCCL repo, and ask them to see how to solve the problem. FYI @RayWang96
@kanghui0204 Ok, I would turn to NCCL repo first. Close it.
I'm trying to run a simple 2-node wide&deep training, but the program complains runtime errors which I'm not clear:
My environment setup:
ps: under single node environment, a wide & deep model can be trained successfully
Could anybody please help have a look at this issue? Thanks.