RuntimeError: CUDA out of memory

keshavoct98 commented 2 years ago

While training the model on dota1.0, gpu memory keeps on increasing after every few iterations until it runs out of memory. I have tried both docker and without docker training. Any ideas on why this is happening?

braun-steven commented 2 years ago

Never had this issue before. Did you change anything in the dependencies or in the code?

keshavoct98 commented 2 years ago

No I didn't. Below is the complete error:

Traceback: File "./tools/plain_train_net.py", line 602, in main do_train(cfg, model, resume=args.resume) File "./tools/plain_train_net.py", line 452, in do_train loss_dict = model(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(*input, *kwargs) File "/app/dafne/dafne/modeling/one_stage_detector.py", line 48, in forward return super().forward(batched_inputs) File "/app/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 313, in forward proposals, proposal_losses = self.proposal_generator(images, features, gt_instances) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(input, *kwargs) File "/app/dafne/dafne/modeling/dafne/dafne.py", line 124, in forward results, losses = self.dafne_outputs.losses( File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 523, in losses training_targets = self._get_ground_truth(locations, gt_instances) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 264, in _get_ground_truth training_targets = self.compute_targets_for_locations( File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 393, in compute_targets_for_locations reg_targets_abcd_per_im = compute_abcd(corners, xs_ext, ys_ext) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 75, in compute_abcd abcd = dist_point_to_line(left, right, xs_ext[..., None], ys_ext[..., None]) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 61, in dist_point_to_line nom = torch.abs((y2 - y1) x0 - (x2 - x1) y0 + x2 y1 - y2 * x1)

Error: CUDA out of memory. Tried to allocate 534.00 MiB (GPU 0; 11.92 GiB total capacity; 9.61 GiB already allocated; 447.12 MiB free; 10.41 GiB reserved in total by PyTorch) Traceback (most recent call last): File "./tools/plain_train_net.py", line 665, in launch( File "/app/detectron2_repo/detectron2/engine/launch.py", line 82, in launch main_func(args) File "./tools/plain_train_net.py", line 656, in main raise e File "./tools/plain_train_net.py", line 602, in main do_train(cfg, model, resume=args.resume) File "./tools/plain_train_net.py", line 452, in do_train loss_dict = model(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(input, *kwargs) File "/app/dafne/dafne/modeling/one_stage_detector.py", line 48, in forward return super().forward(batched_inputs) File "/app/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 313, in forward proposals, proposal_losses = self.proposal_generator(images, features, gt_instances) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(input, *kwargs) File "/app/dafne/dafne/modeling/dafne/dafne.py", line 124, in forward results, losses = self.dafne_outputs.losses( File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 523, in losses training_targets = self._get_ground_truth(locations, gt_instances) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 264, in _get_ground_truth training_targets = self.compute_targets_for_locations( File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 393, in compute_targets_for_locations reg_targets_abcd_per_im = compute_abcd(corners, xs_ext, ys_ext) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 75, in compute_abcd abcd = dist_point_to_line(left, right, xs_ext[..., None], ys_ext[..., None]) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 61, in dist_point_to_line nom = torch.abs((y2 - y1) x0 - (x2 - x1) y0 + x2 y1 - y2 * x1) RuntimeError: CUDA out of memory. Tried to allocate 534.00 MiB (GPU 0; 11.92 GiB total capacity; 9.61 GiB already allocated; 447.12 MiB free; 10.41 GiB reserved in total by PyTorch)

braun-steven commented 2 years ago

Are you sure it runs OOM due to some memory leak or does the model simply not fit into your memory with the batch size you are using?

keshavoct98 commented 2 years ago

The memory consumption increases after every few iterations. I am using batch-size of 8 with single gpu of 12gb memory.

braun-steven commented 2 years ago

Can you try this out with a batch size of 4 or 2 to see if the same issue happens but simply after more iterations?

keshavoct98 commented 2 years ago

Tested with batch-size 2 and 4 as well. Still facing the same issue.

braun-steven commented 2 years ago

At which iteration is this now happening? Still around 500 or much later now?

keshavoct98 commented 2 years ago

With batch-size 2 it is after 2700 iteration.

braun-steven commented 2 years ago

Okay, then this really looks like a memory leak. Since I cannot reproduce this problem I can't really help you here, sorry. Maybe you can try your luck with the PyTorch Profiler and set the maximum number of iterations to just a few hundred before the OOM happens to see which parts take up the largest chunk of memory.

braun-steven commented 2 years ago

Keep me posted if you happen to find out anything interesting based on the profiler results.

keshavoct98 commented 2 years ago

Cool. Thanks for the guidance.

braun-steven / DAFNe

RuntimeError: CUDA out of memory #8