bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

Open udaykiran009 opened 3 years ago

udaykiran009 commented 3 years ago

Hello, I was following the Step-by-Step tutorial and try to build from the source code.

The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change

DMLC_NUM_WORKER=1 to DMLC_NUM_WORKER=2

I launched on Single Node, and this node has 8 GPUs in it.

It is giving the following error: src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC munmap_chunk(): invalid pointer Aborted (core dumped)

I launched in the following order on the same node. Worker -> Server -> Worker -> Scheduler

Bash Script for launching Worker-0 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Worker-1 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Server is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Bash Script for launching Scheduler is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Environment:

Can you please help me in solving this error. Thank you.

ymjiang commented 3 years ago

Can you use gdb and paste the backtrace here?