Hello,
I was following the Step-by-Step tutorial and try to build from the source code.
The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change
DMLC_NUM_WORKER=1
to
DMLC_NUM_WORKER=2
I launched on Single Node, and this node has 8 GPUs in it.
It is giving the following error:
src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
munmap_chunk(): invalid pointer
Aborted (core dumped)
I launched in the following order on the same node.
Worker -> Server -> Worker -> Scheduler
Bash Script for launching Worker-0 is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100
Bash Script for launching Worker-1 is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100
Bash Script for launching Server is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch
Bash Script for launching Scheduler is:
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch
Hello, I was following the Step-by-Step tutorial and try to build from the source code.
The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change
DMLC_NUM_WORKER=1 to DMLC_NUM_WORKER=2
I launched on Single Node, and this node has 8 GPUs in it.
It is giving the following error: src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC munmap_chunk(): invalid pointer Aborted (core dumped)
I launched in the following order on the same node. Worker -> Server -> Worker -> Scheduler
Bash Script for launching Worker-0 is:
Bash Script for launching Worker-1 is:
Bash Script for launching Server is:
Bash Script for launching Scheduler is:
Environment:
Can you please help me in solving this error. Thank you.