bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

Stuck in the bps.init(). #419

Closed Fangjin98 closed 2 years ago

Fangjin98 commented 2 years ago

I have 2 machines, each machine is equipped with a GPU. I tried to run single-machine Pytorch training in both machines, separately.

However, the program can run and perform training in machine 1 but stuck in machine 2 .

The screen shot of machine 2 is as follow: image

The script I run is:

export NVIDIA_VISIBLE_DEVICES=0
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=1
export DMLC_ROLE=worker

bpslaunch python3 ../../benchmark_byteps.py --model AlexNet --num-iters 100

When I press Ctrl+C in machine 2, the screen shot is: image

PS. Machine 2 appears

Traceback (most recent call last):
  File "/home/sdn/fj/byteps_ddp/scripts/run_ddp_tcp/../../basic_byteps.py", line 45, in <module>
    bps.init()
  File "/home/sdn/anaconda3/envs/fj/lib/python3.9/site-packages/byteps/common/__init__.py", line 63, in init
    return self.C_LIB_CTYPES.byteps_lazy_init()
KeyboardInterrupt

after I shut down the training job FOR A WHILE.

I checked the conda list of both machines, the versions of torch and byteps are the same in both machines.

Besides, the version of cuda, cudnn and nccl are the same and as follows:

Fangjin98 commented 2 years ago

The total traceback is as follow:

(fj) sdn@server08:~/fj/byteps_ddp/scripts/run_ddp_tcp$ ./worker0.sh 
BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-11,24-35 python3 ../../basic_byteps.py --model AlexNet --num-iters 100

Traceback (most recent call last):
  File "../../basic_byteps.py", line 45, in <module>
    bps.init()
  File "/home/sdn/anaconda3/envs/fj/lib/python3.8/site-packages/byteps/common/__init__.py", line 63, in init
    return self.C_LIB_CTYPES.byteps_lazy_init()
KeyboardInterrupt
Traceback (most recent call last):
  File "../../basic_byteps.py", line 72, in <module>
    bps.broadcast_optimizer_state(optimizer, root_rank=0)
  File "/home/sdn/anaconda3/envs/fj/lib/python3.8/site-packages/byteps/torch/__init__.py", line 398, in broadcast_optimizer_state
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 239, in launch_bps
    t[i].join()
  File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 34, in join
    raise self.exc
  File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 192, in worker
    subprocess.check_call(command, env=my_env,
  File "/home/sdn/anaconda3/envs/fj/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-11,24-35 python3 ../../basic_byteps.py --model AlexNet --num-iters 100' returned non-zero exit status 1.
Shangwei-Li commented 2 years ago

I encountered a similar problem where the init process wait for threads joining infinitely with cuda11.0 and cuda 11.1 on an ampere card. Would you please share how you solve it?

Fangjin98 commented 2 years ago

It's been a long time, I can't remeber the specific reason of this error. But here are what I do:

  1. Set the number of PSs equals to the number of workers. (I'm not sure this is needed, or not.)
  2. Explicit set the nic of each worker and PS, by export DMLC_INTERFACE=your nic name. In my experimental environment, each machine has 4 nics. By default, byteps chooses eno1 for training. However, I set the used nic for PSs, but not for the workers. So they are communicating in different network segments, which won't work.
  3. Explicit set the ip address of the scheduler for workers and PSs, by export DMLC_PS_ROOT_URL=ip of your scheduler and export DMLC_PS_ROOT_PORT=port of your scheduler.

Besides, you can set the log level to DEBUG or INFO to show more error information. Good luck.

Shangwei-Li commented 2 years ago

Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!

heguangxin commented 1 year ago

Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!

So you successfully run bytePS with CUDA 11? This has troubled me for a long time. If convenient,hope you can breifly introduce your CUDA 11 docker image! Thanks a lot!

Shangwei-Li commented 1 year ago

Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!

So you successfully run bytePS with CUDA 11? This has troubled me for a long time. If convenient,hope you can breifly introduce your CUDA 11 docker image! Thanks a lot!

Unfortunately it's been nearly a year and I can't find that DockerFile. This is the main change that still lingers in my memory: I modified the versions of libcudnn, libnccl2, libnccl-dev specified in original DockerFile to those related with cuda-11.0 according to apt repository. Hope it helps, good luck!

heguangxin commented 1 year ago

thank you for helping, I'll try this. I am using 3080Ti, which only supports CUDA version 11 and later. actually I want to use byteScheduler instead of bytePS.

Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!

So you successfully run bytePS with CUDA 11? This has troubled me for a long time. If convenient,hope you can breifly introduce your CUDA 11 docker image! Thanks a lot!

Unfortunately it's been nearly a year and I can't find that DockerFile. This is the main change that still lingers in my memory: I modified the versions of libcudnn, libnccl2, libnccl-dev specified in original DockerFile to those related with cuda-11.0 according to apt repository. Hope it helps, good luck!

thank you for helping, I'll try this. The machine is equipped with 3080Ti, which only supports CUDA version 11 and later. actually I want to use byteScheduler instead of bytePS