Closed Fangjin98 closed 2 years ago
The total traceback is as follow:
(fj) sdn@server08:~/fj/byteps_ddp/scripts/run_ddp_tcp$ ./worker0.sh
BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-11,24-35 python3 ../../basic_byteps.py --model AlexNet --num-iters 100
Traceback (most recent call last):
File "../../basic_byteps.py", line 45, in <module>
bps.init()
File "/home/sdn/anaconda3/envs/fj/lib/python3.8/site-packages/byteps/common/__init__.py", line 63, in init
return self.C_LIB_CTYPES.byteps_lazy_init()
KeyboardInterrupt
Traceback (most recent call last):
File "../../basic_byteps.py", line 72, in <module>
bps.broadcast_optimizer_state(optimizer, root_rank=0)
File "/home/sdn/anaconda3/envs/fj/lib/python3.8/site-packages/byteps/torch/__init__.py", line 398, in broadcast_optimizer_state
p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 253, in <module>
launch_bps()
File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 239, in launch_bps
t[i].join()
File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 34, in join
raise self.exc
File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 27, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/home/sdn/anaconda3/envs/fj/bin/bpslaunch", line 192, in worker
subprocess.check_call(command, env=my_env,
File "/home/sdn/anaconda3/envs/fj/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-11,24-35 python3 ../../basic_byteps.py --model AlexNet --num-iters 100' returned non-zero exit status 1.
I encountered a similar problem where the init process wait for threads joining infinitely with cuda11.0 and cuda 11.1 on an ampere card. Would you please share how you solve it?
It's been a long time, I can't remeber the specific reason of this error. But here are what I do:
export DMLC_INTERFACE=your nic name
. In my experimental environment, each machine has 4 nics. By default, byteps chooses eno1 for training. However, I set the used nic for PSs, but not for the workers. So they are communicating in different network segments, which won't work.export DMLC_PS_ROOT_URL=ip of your scheduler
and export DMLC_PS_ROOT_PORT=port of your scheduler
.Besides, you can set the log level to DEBUG or INFO to show more error information. Good luck.
Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!
Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!
So you successfully run bytePS with CUDA 11? This has troubled me for a long time. If convenient,hope you can breifly introduce your CUDA 11 docker image! Thanks a lot!
Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!
So you successfully run bytePS with CUDA 11? This has troubled me for a long time. If convenient,hope you can breifly introduce your CUDA 11 docker image! Thanks a lot!
Unfortunately it's been nearly a year and I can't find that DockerFile. This is the main change that still lingers in my memory: I modified the versions of libcudnn, libnccl2, libnccl-dev specified in original DockerFile to those related with cuda-11.0 according to apt repository. Hope it helps, good luck!
thank you for helping, I'll try this. I am using 3080Ti, which only supports CUDA version 11 and later. actually I want to use byteScheduler instead of bytePS.
Thank you for sharing. Today i rebuild my cuda 11.0 docker image and run that single machine training job again, and it is running correctly. Besides, my system is running out of disk space so i am not going to rebuild a cuda 11.1 image soon to reproduce that problem. Anyway, thanks for your time!
So you successfully run bytePS with CUDA 11? This has troubled me for a long time. If convenient,hope you can breifly introduce your CUDA 11 docker image! Thanks a lot!
Unfortunately it's been nearly a year and I can't find that DockerFile. This is the main change that still lingers in my memory: I modified the versions of libcudnn, libnccl2, libnccl-dev specified in original DockerFile to those related with cuda-11.0 according to apt repository. Hope it helps, good luck!
thank you for helping, I'll try this. The machine is equipped with 3080Ti, which only supports CUDA version 11 and later. actually I want to use byteScheduler instead of bytePS
I have 2 machines, each machine is equipped with a GPU. I tried to run single-machine Pytorch training in both machines, separately.
However, the program can run and perform training in machine 1 but stuck in machine 2 .
The screen shot of machine 2 is as follow:
The script I run is:
When I press Ctrl+C in machine 2, the screen shot is:
PS. Machine 2 appears
after I shut down the training job FOR A WHILE.
I checked the conda list of both machines, the versions of torch and byteps are the same in both machines.
Besides, the version of cuda, cudnn and nccl are the same and as follows: