ethnhe / FFB6D

[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.
MIT License
299 stars 71 forks source link

CUDA out of memory related to data parallel #35

Open taeyeopl opened 3 years ago

taeyeopl commented 3 years ago

I have a simple question related to the data parallel and your code. I ran provided scripts. When I ran the script, I got out of memory. My main problem comes from the 0 GPU get more memories than others.

Q1. Have you experienced similar issues?? or this phenomenon is natural?? I observed that the number of workers affected 0 GPU usage. Can you explain where the 147MiB is allocated in your code??

I think normalSpeed.depth_normal uses memory and the amount of memory depends on the size of workers. Does it right? Do you have a solution to make this memory allocates distributed??

Current my GPU is TITAN Xp. Screenshot from 2021-09-05 12-50-37

taeyeopl commented 3 years ago

I solved the problem using some tricks. If you can share some comments related to your experiences, It would be really helpful.

I changed CUDA_VISIBLE_DEVICES and others

os.environ['CUDA_VISIBLE_DEVICES'] = str(args.local_rank)
device = torch.device('cuda:{}'.format(0))
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[0], output_device=0,

Screenshot from 2021-09-05 19-43-50

an99990 commented 3 years ago

hey @taeyeop-lee i just tried changing the lines like you did. I get the error that model is not defined for this

model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[0], output_device=0,