CUDA out of memory for coco_evaluator

Zagreus98 commented 1 year ago

First of all thank you for your work. I wanted to ask if you know how can I solve this problem. When I try to evaluate your provided multi-frame model with 14 ref frames, using r50_eval_multi.sh, the evaluation crushes with CUDA out of memory error. I have to mention that when I trained with r50_train_multi it worked just fine, and when I perform evaluation with the single frame model using a single GPU it also works fine.

My setup is: 4 x TITAN Xp GPUs with 12196 MiB, in my opinion this should be enough for a validation... What is strange is that during the evaluation each gpu is at aroung 4000 MiB memory-usage, so it shouldn't be a problem...

The logs:

Test: Total time: 6:59:35 (0.5718 s / it) Averaged stats: class_error: 37.50 loss: 1.6011 (1.0219) loss_bbox: 0.3632 (0.2952) loss_ce: 0.9131 (0.3930) loss_giou: 0.2270 (0.3336) cardinality_error_unscaled: 298.5000 (295.6008) class_error_unscaled: 50.0000 (14.1343) loss_bbox_unscaled: 0.0726 (0.0590) loss_ce_unscaled: 0.4565 (0.1965) loss_giou_unscaled: 0.1135 (0.1668) Traceback (most recent call last): File "main.py", line 355, in main(args) File "main.py", line 288, in main data_loader_val, base_ds, device, args.output_dir) File "/home/alandrei/miniforge-pypy3/envs/py369/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, kwargs) File "/data3/alandrei/Temporal_OD/TransVOD/engine_multi.py", line 141, in evaluate coco_evaluator.synchronize_between_processes() File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 66, in synchronize_between_processes create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type]) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 201, in create_common_coco_eval img_ids, eval_imgs = merge(img_ids, eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 180, in merge all_eval_imgs = all_gather(eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/util/misc.py", line 153, in all_gather tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) RuntimeError: CUDA out of memory. Tried to allocate 2.72 GiB (GPU 0; 11.91 GiB total capacity; 11.24 GiB already allocated; 40.62 MiB free; 11.32 GiB reserved in total by PyTorch) Traceback (most recent call last): File "main.py", line 355, in main(args) File "main.py", line 288, in main data_loader_val, base_ds, device, args.output_dir) File "/home/alandrei/miniforge-pypy3/envs/py369/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, *kwargs) File "/data3/alandrei/Temporal_OD/TransVOD/engine_multi.py", line 141, in evaluate coco_evaluator.synchronize_between_processes() File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 66, in synchronize_between_processes create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type]) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 201, in create_common_coco_eval img_ids, eval_imgs = merge(img_ids, eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 180, in merge all_eval_imgs = all_gather(eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/util/misc.py", line 153, in all_gather tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) RuntimeError: CUDA out of memory. Tried to allocate 2.72 GiB (GPU 3; 11.91 GiB total capacity; 11.27 GiB already allocated; 6.62 MiB free; 11.35 GiB reserved in total by PyTorch) Traceback (most recent call last): File "main.py", line 355, in main(args) File "main.py", line 288, in main data_loader_val, base_ds, device, args.output_dir) File "/home/alandrei/miniforge-pypy3/envs/py369/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(args, kwargs) File "/data3/alandrei/Temporal_OD/TransVOD/engine_multi.py", line 141, in evaluate coco_evaluator.synchronize_between_processes() File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 66, in synchronize_between_processes create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type]) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 201, in create_common_coco_eval img_ids, eval_imgs = merge(img_ids, eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 180, in merge all_eval_imgs = all_gather(eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/util/misc.py", line 153, in all_gather tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) RuntimeError: CUDA out of memory. Tried to allocate 2.72 GiB (GPU 2; 11.91 GiB total capacity; 11.22 GiB already allocated; 54.62 MiB free; 11.30 GiB reserved in total by PyTorch) Traceback (most recent call last): File "main.py", line 355, in main(args) File "main.py", line 288, in main data_loader_val, base_ds, device, args.output_dir) File "/home/alandrei/miniforge-pypy3/envs/py369/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/data3/alandrei/Temporal_OD/TransVOD/engine_multi.py", line 141, in evaluate coco_evaluator.synchronize_between_processes() File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 66, in synchronize_between_processes create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type]) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 201, in create_common_coco_eval img_ids, eval_imgs = merge(img_ids, eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/datasets/coco_eval.py", line 180, in merge all_eval_imgs = all_gather(eval_imgs) File "/data3/alandrei/Temporal_OD/TransVOD/util/misc.py", line 153, in all_gather tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) RuntimeError: CUDA out of memory. Tried to allocate 2.72 GiB (GPU 1; 11.91 GiB total capacity; 11.23 GiB already allocated; 28.62 MiB free; 11.33 GiB reserved in total by PyTorch) Traceback (most recent call last): File "./tools/launch.py", line 192, in main() File "./tools/launch.py", line 188, in main cmd=process.args) subprocess.CalledProcessError: Command '['configs/r50_eval_multi.sh']' returned non-zero exit status 1.

Zagreus98 commented 1 year ago

Update: I evaluated the model without using the script for multi-gpus and using a single GPU and it worked...

Chrazqee commented 1 year ago

My single GPU with 24G out of memory when I implement the 'Compiling CUDA operators', I don't know why. The logs are as follow: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 23.68 GiB total capacity; 21.04 GiB already allocated; 52.25 MiB free; 21.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

SJTU-LuHe / TransVOD

CUDA out of memory for coco_evaluator #19