Out of memory error during evaluation but training works fine!

Hello, I am trying to fine tune the model with custom dataset which is in VOC format. The training completed fine but when during evalution, cuda gives an out of memory error. Could you please guide? I have used the same pascal_voc_evaluation.py with minor modification such as I commented out #self._base_classes = meta.base_classes as I have used pre-trained VOC weights. I would really appreciate a help.

Note: I have not resize my custom dataset as the DeFRCN model as I understood does it for us during training and inference. But, please correct me if I am wrong.

save changed ckpt to /home/jovyan/thesis_s2577712/DeFRCN/checkpoints/checkpoints/voc/dryrundefrcn/defrcn_det_r101_base1/model_reset_remove.pth testing! [02/04 21:06:56 detectron2]: Rank of current process: 0. World size: 1 [02/04 21:06:56 detectron2]: Command line arguments: Namespace(config_file='configs/voc/defrcn_fsod_r101_novel1_1shot_seed0.yaml', dist_url='tcp://127.0.0.1:50152', end_iter=-1, eval_all=False, eval_during_train=False, eval_iter=-1, eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=['MODEL.WEIGHTS', '/home/jovyan/thesis_s2577712/DeFRCN/checkpoints/checkpoints/voc/dryrundefrcn/defrcn_det_r101_base1/model_reset_remove.pth', 'OUTPUT_DIR', '/home/jovyan/thesis_s2577712/DeFRCN/checkpoints/checkpoints/voc/dryrundefrcn/defrcn_fsod_r101_novel1/fsrw-like/1shot_seed0_repeat0', 'TEST.PCB_MODELPATH', '/home/jovyan/thesis_s2577712/DeFRCN/ImageNetPretrainedextracted/ImageNetPretrained/torchvision/resnet101-5d3b4d8f.pth'], resume=False, start_iter=-1) [02/04 21:06:56 detectron2]: Contents of args.config_file=configs/voc/defrcn_fsod_r101_novel1_1shot_seed0.yaml: BASE: "../Base-RCNN.yaml" MODEL: WEIGHTS: "/Path/to/Base/Pretrain/Weight" MASK_ON: False BACKBONE: FREEZE: False RESNETS: DEPTH: 101 RPN: ENABLE_DECOUPLE: True BACKWARD_SCALE: 0.0 FREEZE: False ROI_HEADS: ENABLE_DECOUPLE: True BACKWARD_SCALE: 0.001 NUM_CLASSES: 1 FREEZE_FEAT: True CLS_DROPOUT: True INPUT: MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800) MIN_SIZE_TEST: 800 DATASETS: TRAIN: ("jk_Video_trainval_video1_1shot_seed0", ) TEST: ('JKimage',) SOLVER: IMS_PER_BATCH: 2 BASE_LR: 0.00125 STEPS: (640, ) MAX_ITER: 200 CHECKPOINT_PERIOD: 100000 WARMUP_ITERS: 0 TEST: PCB_ENABLE: True PCB_MODELPATH: "/Path/to/ImageNet/Pre-Train/Weight" OUTPUT_DIR: "/Path/to/Output/Dir" [02/04 21:06:56 detectron2]: Full config saved to /home/jovyan/thesis_s2577712/DeFRCN/checkpoints/checkpoints/voc/dryrundefrcn/defrcn_fsod_r101_novel1/fsrw-like/1shot_seed0_repeat0/config.yaml [02/04 21:06:56 d2.utils.env]: Using a generated random seed 56161326 WARNING [02/04 21:06:56 d2.modeling.backbone.resnet]: ResNet.make_stage(first_stride=) is deprecated! Use 'stride_per_block' or 'stride' instead. froze roi_box_head parameters [02/04 21:06:59 defrcn.dataloader.build]: Removed 0 images with no usable annotations. 1 images left. [02/04 21:06:59 defrcn.dataloader.build]: Distribution of instances among all 1 categories:

category : Japanese Knotweed instances: 1

[02/04 21:06:59 defrcn.dataloader.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [02/04 21:06:59 defrcn.dataloader.build]: Using training sampler TrainingSampler [02/04 21:06:59 d2.data.common]: Serializing 1 elements to byte tensors and concatenating them all ... [02/04 21:06:59 d2.data.common]: Serialized dataset takes 0.00 MiB 2023-02-04 21:06:59.524130: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-02-04 21:06:59.659302: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-02-04 21:07:00.199241: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/jovyan/.local/lib/python3.8/site-packages/cv2/../../lib64:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-02-04 21:07:00.199310: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/jovyan/.local/lib/python3.8/site-packages/cv2/../../lib64:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-02-04 21:07:00.199316: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [02/04 21:07:01 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/jovyan/thesis_s2577712/DeFRCN/checkpoints/checkpoints/voc/dryrundefrcn/defrcn_det_r101_base1/model_reset_remove.pth ... [02/04 21:07:01 d2.engine.train_loop]: Starting training from iteration 1 /home/jovyan/thesis_s2577712/DeFRCN/defrcn/modeling/roi_heads/fast_rcnn.py:198: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.) num_fg = fg_inds.nonzero().numel() [02/04 21:07:14 d2.utils.events]: eta: 0:02:05 iter: 19 total_loss: 0.2943 loss_cls: 0.159 loss_box_reg: 0.03856 loss_rpn_cls: 0.04587 loss_rpn_loc: 0.007675 time: 0.6996 data_time: 0.0321 lr: 0.00125 max_mem: 2777M [02/04 21:07:28 d2.utils.events]: eta: 0:01:49 iter: 39 total_loss: 0.307 loss_cls: 0.1948 loss_box_reg: 0.09194 loss_rpn_cls: 0.01597 loss_rpn_loc: 0.002623 time: 0.6864 data_time: 0.0069 lr: 0.00125 max_mem: 2777M [02/04 21:07:41 d2.utils.events]: eta: 0:01:34 iter: 59 total_loss: 0.1985 loss_cls: 0.09809 loss_box_reg: 0.07793 loss_rpn_cls: 0.01192 loss_rpn_loc: 0.002571 time: 0.6770 data_time: 0.0064 lr: 0.00125 max_mem: 2777M [02/04 21:07:55 d2.utils.events]: eta: 0:01:20 iter: 79 total_loss: 0.1516 loss_cls: 0.07986 loss_box_reg: 0.06549 loss_rpn_cls: 0.007284 loss_rpn_loc: 0.002403 time: 0.6757 data_time: 0.0075 lr: 0.00125 max_mem: 2777M [02/04 21:08:09 d2.utils.events]: eta: 0:01:07 iter: 99 total_loss: 0.1639 loss_cls: 0.08123 loss_box_reg: 0.0759 loss_rpn_cls: 0.005892 loss_rpn_loc: 0.002034 time: 0.6844 data_time: 0.0067 lr: 0.00125 max_mem: 2777M [02/04 21:08:23 d2.utils.events]: eta: 0:00:54 iter: 119 total_loss: 0.1657 loss_cls: 0.07605 loss_box_reg: 0.07866 loss_rpn_cls: 0.006489 loss_rpn_loc: 0.002039 time: 0.6857 data_time: 0.0066 lr: 0.00125 max_mem: 2777M [02/04 21:08:38 d2.utils.events]: eta: 0:00:41 iter: 139 total_loss: 0.1578 loss_cls: 0.07315 loss_box_reg: 0.07192 loss_rpn_cls: 0.004242 loss_rpn_loc: 0.001812 time: 0.6935 data_time: 0.0060 lr: 0.00125 max_mem: 2777M [02/04 21:08:52 d2.utils.events]: eta: 0:00:27 iter: 159 total_loss: 0.1471 loss_cls: 0.06966 loss_box_reg: 0.07129 loss_rpn_cls: 0.003806 loss_rpn_loc: 0.001632 time: 0.7000 data_time: 0.0060 lr: 0.00125 max_mem: 2777M [02/04 21:09:07 d2.utils.events]: eta: 0:00:14 iter: 179 total_loss: 0.1463 loss_cls: 0.069 loss_box_reg: 0.06876 loss_rpn_cls: 0.003436 loss_rpn_loc: 0.00178 time: 0.7027 data_time: 0.0066 lr: 0.00125 max_mem: 2777M [02/04 21:09:22 fvcore.common.checkpoint]: Saving checkpoint to /home/jovyan/thesis_s2577712/DeFRCN/checkpoints/checkpoints/voc/dryrundefrcn/defrcn_fsod_r101_novel1/fsrw-like/1shot_seed0_repeat0/model_final.pth [02/04 21:09:23 d2.utils.events]: eta: 0:00:00 iter: 199 total_loss: 0.1495 loss_cls: 0.07159 loss_box_reg: 0.07191 loss_rpn_cls: 0.002919 loss_rpn_loc: 0.001454 time: 0.7059 data_time: 0.0075 lr: 0.00125 max_mem: 2777M [02/04 21:09:23 d2.engine.hooks]: Overall training speed: 197 iterations in 0:02:19 (0.7059 s / it) [02/04 21:09:23 d2.engine.hooks]: Total training time: 0:02:20 (0:00:01 on hooks) [02/04 21:09:23 defrcn.dataloader.build]: Distribution of instances among all 1 categories:

category : Japanese Knotweed instances: 67

[02/04 21:09:23 defrcn.dataloader.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')] [02/04 21:09:23 d2.data.common]: Serializing 24 elements to byte tensors and concatenating them all ... [02/04 21:09:23 d2.data.common]: Serialized dataset takes 0.01 MiB [02/04 21:09:23 defrcn.evaluation.evaluator]: Start initializing PCB module, please wait a seconds... [02/04 21:09:23 defrcn.evaluation.calibration_layer]: Loading ImageNet Pre-train Model from /home/jovyan/thesis_s2577712/DeFRCN/ImageNetPretrainedextracted/ImageNetPretrained/torchvision/resnet101-5d3b4d8f.pth [02/04 21:09:24 defrcn.dataloader.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')] [02/04 21:09:24 d2.data.common]: Serializing 1 elements to byte tensors and concatenating them all ... [02/04 21:09:24 d2.data.common]: Serialized dataset takes 0.00 MiB Traceback (most recent call last): File "main.py", line 72, in launch( File "/home/jovyan/.local/lib/python3.8/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(args) File "main.py", line 67, in main return trainer.train() File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/engine/defaults.py", line 385, in train super().train(self.start_iter, self.max_iter) File "/home/jovyan/.local/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 144, in train self.after_train() File "/home/jovyan/.local/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 153, in after_train h.after_train() File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/engine/hooks.py", line 80, in after_train self._do_eval() File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/engine/hooks.py", line 39, in _do_eval results = self._func() File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/engine/defaults.py", line 335, in test_and_save_results self._last_eval_results = self.test(self.cfg, self.model) File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/engine/defaults.py", line 498, in test results_i = inference_on_dataset(model, data_loader, evaluator, cfg) File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/evaluation/evaluator.py", line 90, in inference_on_dataset pcb = PrototypicalCalibrationBlock(cfg) File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/evaluation/calibration_layer.py", line 28, in init self.prototypes = self.build_prototypes() File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/evaluation/calibration_layer.py", line 58, in build_prototypes features = self.extract_roi_features(img, boxes) File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/evaluation/calibration_layer.py", line 98, in extract_roi_features conv_feature = self.imagenet_model(images.tensor[:, [2, 1, 0]])[1] # size: BxCxHxW File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/evaluation/archs/resnet.py", line 203, in forward x = self.layer3(x) File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/jovyan/thesis_s2577712/DeFRCN/defrcn/evaluation/archs/resnet.py", line 98, in forward out = self.conv1(x) File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "/home/jovyan/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 415, in _conv_forward return F.conv2d(input, weight, self.bias, self.stride, RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 14.56 GiB total capacity; 13.28 GiB already allocated; 17.50 MiB free; 13.57 GiB reserved in total by PyTorch) Traceback (most recent call last): File "tools/extract_results.py", line 59, in main() File "tools/extract_results.py", line 34, in main results.append([fid] + [float(x) for x in res_info.split(':')[-1].split(',')]) File "tools/extract_results.py", line 34, in results.append([fid] + [float(x) for x in res_info.split(':')[-1].split(',')]) ValueError: could not convert string to float: ' Serialized dataset takes 0.00 MiB'

er-muyue / DeFRCN

Out of memory error during evaluation but training works fine! #65