Closed aragakiyui611 closed 3 years ago
What's your GPU information? Get it by running nvidia-smi
. And what's your command to start the program?
I ran sh train_ycb.sh
which contains codes below:
!/bin/bash
n_gpu=4 # number of gpu to use python -m torch.distributed.launch --nproc_per_node=$n_gpu train_ycb.py --gpus=$n_gpu
then the full log is:
(gorilla) xxx@ubuntu:~/FFB6D/ffb6d$ bash train_ycb.sh
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
local_rank: 1 local_rank: 3 local_rank: 0 local_rank: 2 train_dataset_size: 96189 test_dataset_size: 2949 loading resnet34 pretrained mdl. local_rank: 2 train_dataset_size: 96189 test_dataset_size: 2949 train_dataset_size: 96189 train_dataset_size: 96189 test_dataset_size: 2949 test_dataset_size: 2949 loading resnet34 pretrained mdl. loading resnet34 pretrained mdl. loading resnet34 pretrained mdl. local_rank: 0 local_rank: 1 local_rank: 3 Selected optimization level O0: Pure FP32 training.
Defaults for this optimization level are: enabled : True opt_level : O0 cast_model_type : torch.float32 patch_torch_functions : False keep_batchnorm_fp32 : None master_weights : False loss_scale : 1.0 Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O0 cast_model_type : torch.float32 patch_torch_functions : False keep_batchnorm_fp32 : None master_weights : False loss_scale : 1.0 /home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead. warnings.warn(warning.format(ret)) Totally train 200393 iters per gpu. epochs: 0%| | 0/25 [00:00<?, ?it/s/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead. warnings.warn(warning.format(ret)) Totally train 200393 iters per gpu. epochs: 0%| | 0/25 [00:00<?, ?it/s/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead. warnings.warn(warning.format(ret)) Totally train 200393 iters per gpu. /home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead. warnings.warn(warning.format(ret)) epochs: 0%| | 0/25 [00:00<?, ?it/s]Totally train 200393 iters per gpu. epochs: 0%| | 0/25 [00:05<?, ?it/s] train: 0%| | 0/5566 [00:00<?, ?it/sTraceback (most recent call last):
File "train_ycb.py", line 672, intrain() File "train_ycb.py", line 663, in train clr_div=clr_div File "trainycb.py", line 468, in train , loss, res = self.model_fn(self.model, batch, it=it) File "train_ycb.py", line 229, in model_fn end_points = model(cu_dt) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 epochs: 0%| | 0/25 [00:06<?, ?it/s] Traceback (most recent call last):
File "train_ycb.py", line 672, intrain() File "train_ycb.py", line 663, in train clr_div=clr_div File "trainycb.py", line 468, in train , loss, res = self.model_fn(self.model, batch, it=it) File "train_ycb.py", line 229, in model_fn end_points = model(cu_dt) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 epochs: 0%| | 0/25 [00:06<?, ?it/s] Traceback (most recent call last):
File "train_ycb.py", line 672, intrain() File "train_ycb.py", line 663, in train clr_div=clr_div File "trainycb.py", line 468, in train , loss, res = self.model_fn(self.model, batch, it=it) File "train_ycb.py", line 229, in model_fn end_points = model(cu_dt) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/xxx/anaconda3/envs/gorilla/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
and the output of nvidia-smi
is:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+================| | 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A | | 23% 42C P2 71W / 250W | 963MiB / 11178MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A | | 23% 22C P8 8W / 250W | 721MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A | | 23% 26C P8 7W / 250W | 715MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A | | 23% 26C P8 8W / 250W | 12MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |===============================================| | 0 30325 C xxx/anaconda3/envs/gorilla/bin/python 953MiB | | 1 30325 C xxx/anaconda3/envs/gorilla/bin/python 711MiB | | 2 30325 C xxx/anaconda3/envs/gorilla/bin/python 705MiB | +-----------------------------------------------------------------------------+
I modified these lines to fit my configuration. There are 8 GPUs in the machine and I use No. 0~3 GPUs
Thank you, it works after removing this line. Besides, is there any ways to solve this multi-output?
The bug caused by nn.DataParallel is fixed, use git pull
to update your code so that it won't affect the evaluation.
Thank you, it works after removing this line. Besides, are there any ways to solve this multi-output?
Yes, you can modify the code and use tqdm.tqdm
only when args.local_rank == 0
. But the multi-output won't affect the training so I put it in low priority.
How could I solve this bug? Thank you!