Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2

Hello, I was following the Step-by-Step tutorial and try to build from the source code.

The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change

DMLC_NUM_WORKER=1 to DMLC_NUM_WORKER=2

I launched on Single Node, and this node has 8 GPUs in it.

It is giving the following error: src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC munmap_chunk(): invalid pointer Aborted (core dumped)

I launched in the following order on the same node. Worker -> Server -> Worker -> Scheduler

Bash Script for launching Worker-0 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Worker-1 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Server is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Bash Script for launching Scheduler is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Environment:

OS: Ubuntu 18.04.5 LTS
GCC version: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CUDA and NCCL version: CUDA 11.0 & NCCL 2.7.8
Framework (TF, PyTorch, MXNet): PyTorch

Can you please help me in solving this error. Thank you.

bytedance / byteps

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398