Closed wanghangege closed 3 years ago
Hi, Could you post your training log file? Details will be much more helpful for debugging.
Hi, Could you post your training log file? Details will be much more helpful for debugging.
OH, repeat: First of all, thank you for your reply! When I run the example command " python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C51x.yaml ", out of memory appears(before problems)! So I adjusted the config to "Base-YOLOF.yaml" , and then adjusted this file's parameter "IMS PER_ Batch: 32 ", batch is 32 and can be run. My environment is: GPU: titan v 2 = 11g 2 pytorch=1.8.1, python=3.6, cuda10.1,cudnn=7.6.3。 Is it because of my memory problem that I made a mistake? Some config files don't have batch adjustment. Where can I define them? Looking forward to your reply, thank you again!
The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.
The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.
thank you, i solve it!
[04/14 16:15:25 d2.engine.hooks]: Total training time: 0:00:24 (0:00:00 on hooks) [04/14 16:15:25 d2.utils.events]: iter: 0 lr: N/A max_mem: 7597M Traceback (most recent call last): File "./tools/train_net.py", line 234, in
args=(args,),
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 79, in launch
daemon=False,
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker main_func(args) File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/tools/train_net.py", line 221, in main return trainer.train() File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 480, in train super().train(self.start_iter, self.max_iter) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 490, in run_step self._trainer.run_step() File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/yolof/modeling/yolof.py", line 273, in forward features = self.backbone(images.tensor) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 449, in forward x = stage(x) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/wrappers.py", line 88, in forward x = self.norm(x) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/batch_norm.py", line 65, in forward eps=self.eps, File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/functional.py", line 2150, in batch_norm input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 2.04 GiB (GPU 1; 11.78 GiB total capacity; 5.89 GiB already allocated; 751.50 MiB free; 9.00 GiB reserved in total by PyTorch)