Closed keshavoct98 closed 1 year ago
Never had this issue before. Did you change anything in the dependencies or in the code?
No I didn't. Below is the complete error:
Traceback: File "./tools/plain_train_net.py", line 602, in main do_train(cfg, model, resume=args.resume) File "./tools/plain_train_net.py", line 452, in do_train loss_dict = model(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(*input, *kwargs) File "/app/dafne/dafne/modeling/one_stage_detector.py", line 48, in forward return super().forward(batched_inputs) File "/app/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 313, in forward proposals, proposal_losses = self.proposal_generator(images, features, gt_instances) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(input, *kwargs) File "/app/dafne/dafne/modeling/dafne/dafne.py", line 124, in forward results, losses = self.dafne_outputs.losses( File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 523, in losses training_targets = self._get_ground_truth(locations, gt_instances) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 264, in _get_ground_truth training_targets = self.compute_targets_for_locations( File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 393, in compute_targets_for_locations reg_targets_abcd_per_im = compute_abcd(corners, xs_ext, ys_ext) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 75, in compute_abcd abcd = dist_point_to_line(left, right, xs_ext[..., None], ys_ext[..., None]) File "/app/dafne/dafne/modeling/dafne/dafne_outputs.py", line 61, in dist_point_to_line nom = torch.abs((y2 - y1) x0 - (x2 - x1) y0 + x2 y1 - y2 * x1)
Error: CUDA out of memory. Tried to allocate 534.00 MiB (GPU 0; 11.92 GiB total capacity; 9.61 GiB already allocated; 447.12 MiB free; 10.41 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "./tools/plain_train_net.py", line 665, in
Are you sure it runs OOM due to some memory leak or does the model simply not fit into your memory with the batch size you are using?
The memory consumption increases after every few iterations. I am using batch-size of 8 with single gpu of 12gb memory.
Can you try this out with a batch size of 4 or 2 to see if the same issue happens but simply after more iterations?
Tested with batch-size 2 and 4 as well. Still facing the same issue.
At which iteration is this now happening? Still around 500 or much later now?
With batch-size 2 it is after 2700 iteration.
Okay, then this really looks like a memory leak. Since I cannot reproduce this problem I can't really help you here, sorry. Maybe you can try your luck with the PyTorch Profiler and set the maximum number of iterations to just a few hundred before the OOM happens to see which parts take up the largest chunk of memory.
Keep me posted if you happen to find out anything interesting based on the profiler results.
Cool. Thanks for the guidance.
While training the model on dota1.0, gpu memory keeps on increasing after every few iterations until it runs out of memory. I have tried both docker and without docker training. Any ideas on why this is happening?