chensnathan / YOLOF

You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2
MIT License
271 stars 28 forks source link

OOM?请问这个是什么错误呢? #14

Closed wanghangege closed 3 years ago

wanghangege commented 3 years ago

[04/14 16:15:25 d2.engine.hooks]: Total training time: 0:00:24 (0:00:00 on hooks) [04/14 16:15:25 d2.utils.events]: iter: 0 lr: N/A max_mem: 7597M Traceback (most recent call last): File "./tools/train_net.py", line 234, in args=(args,), File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 79, in launch daemon=False, File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker main_func(args) File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/tools/train_net.py", line 221, in main return trainer.train() File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 480, in train super().train(self.start_iter, self.max_iter) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 490, in run_step self._trainer.run_step() File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/yolof/modeling/yolof.py", line 273, in forward features = self.backbone(images.tensor) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 449, in forward x = stage(x) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/wrappers.py", line 88, in forward x = self.norm(x) File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/batch_norm.py", line 65, in forward eps=self.eps, File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/functional.py", line 2150, in batch_norm input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 2.04 GiB (GPU 1; 11.78 GiB total capacity; 5.89 GiB already allocated; 751.50 MiB free; 9.00 GiB reserved in total by PyTorch)

chensnathan commented 3 years ago

Hi, Could you post your training log file? Details will be much more helpful for debugging.

wanghangege commented 3 years ago

Hi, Could you post your training log file? Details will be much more helpful for debugging.

OH, repeat: First of all, thank you for your reply! When I run the example command " python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C51x.yaml ", out of memory appears(before problems)! So I adjusted the config to "Base-YOLOF.yaml" , and then adjusted this file's parameter "IMS PER_ Batch: 32 ", batch is 32 and can be run. My environment is: GPU: titan v 2 = 11g 2 pytorch=1.8.1, python=3.6, cuda10.1,cudnn=7.6.3。 Is it because of my memory problem that I made a mistake? Some config files don't have batch adjustment. Where can I define them? Looking forward to your reply, thank you again!

chensnathan commented 3 years ago

The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.

wanghangege commented 3 years ago

The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.

thank you, i solve it!