How can I self-train with 1 gpu？

Howie-Ye commented 1 year ago

Thank you for the cool work! Now I have a question that how can I use just 1 gpu to train my dataset ? Since there is a choice " --num-gpus " in the arg set , and I only have 1 gpu ,and my datasets is pretty small.

But there is a bug when I launch the script: python train_net.py --num-gpus 1 --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_self_train.yaml --train-dataset imagenet_train_r1 OUTPUT_DIR ../model_output/self-train-r1/ But it doesn't work , here is the part of logs with bug.

[07/21 15:40:26 d2.engine.train_loop]: Starting training from iteration 0 /root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.) torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks]) /root/autodl-tmp/project/CutLER/cutler/data/detection_utils.py:437: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.) torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks]) ERROR [07/21 15:40:27 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step self._trainer.run_step() File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step loss_dict = self.model(data) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward features = self.backbone(images.tensor) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward bottom_up_features = self.bottom_up(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward x = self.stem(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward x = self.conv1(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward x = self.norm(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward world_size = torch.distributed.get_world_size(process_group) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size return _get_group_size(group) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size default_pg = _get_default_group() File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. [07/21 15:40:27 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks) [07/21 15:40:27 d2.utils.events]: iter: 0 lr: N/A max_mem: 1098M Traceback (most recent call last): File "train_net.py", line 170, in launch( File "/root/autodl-tmp/project/detectron2/detectron2/engine/launch.py", line 84, in launch main_func(args) File "train_net.py", line 160, in main return trainer.train() File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 495, in train super().train(self.start_iter, self.max_iter) File "/root/autodl-tmp/project/detectron2/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/root/autodl-tmp/project/CutLER/cutler/engine/defaults.py", line 505, in run_step self._trainer.run_step() File "/root/autodl-tmp/project/CutLER/cutler/engine/train_loop.py", line 335, in run_step loss_dict = self.model(data) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/root/autodl-tmp/project/CutLER/cutler/modeling/meta_arch/rcnn.py", line 160, in forward features = self.backbone(images.tensor) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/fpn.py", line 139, in forward bottom_up_features = self.bottom_up(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward x = self.stem(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward x = self.conv1(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/root/autodl-tmp/project/detectron2/detectron2/layers/wrappers.py", line 131, in forward x = self.norm(x) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward world_size = torch.distributed.get_world_size(process_group) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size return _get_group_size(group) File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size default_pg = _get_default_group() File "/root/miniconda3/envs/cutler/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

It looks like referring to DDP. Thanks for your help !

frank-xwang commented 1 year ago

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Howie-Ye commented 1 year ago

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Thanks for your reply！ I‘ve read the code and find these configs. Finally, I used at least 2 GPU which could address the issue.

Howie-Ye commented 1 year ago

Hello, considering that our model utilizes SyncBatchNorm, it is important to note that it cannot be used with a single worker on either CPU or GPU. To address this issue, I suggest referring to this link in the Detectron2 repository, where you may find a resolution for this problem.

Thanks for your reply！ I‘ve read the code and find these configs. Finally, I used at least 2 GPU which could address the issue.

BTW, I found that if I use python=3.8 torch=1.8.1 cudatoolkit=11.3 with RTX30 series , this CUDA error index >= -sizes[i] && index < sizes[i] && "index out of bounds will appear which should already fixed according to this issue and this issue. Confusing. So I have to replace GPU with RTX20 series and cudatoolkit=10.2 .But the batch size must be set smaller in config file. I don't know why only me have met these problems. T_T

frank-xwang commented 1 year ago

This bug seems to be a hardware related issue, sorry for not being able to help much. Let me know if there is any code related bugs.

facebookresearch / CutLER

How can I self-train with 1 gpu？ #38