NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
905 stars 196 forks source link

[Question] COnfiguration issues with mlcommon benchmarking #421

Open raghavendrachari08 opened 9 months ago

raghavendrachari08 commented 9 months ago

Hi, I Am trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver. [1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out [hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout [hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1 Traceback (most recent call last): File "/dev/shm/data/hugectl/train.py", line 344, in model = hugectr.Model(solver, reader, optimizer) RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list MPI_Bcast(&seed, 1, (static_cast (static_cast<void > (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast (static_cast<void > (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

shijieliu commented 9 months ago

Hi @raghavendrachari08

[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1

Looks to me the error is related with multinode MPI setting. Could you check your multinode MPI setting by running some demo code?