CUDA out of memory related to data parallel

taeyeopl commented 3 years ago

I have a simple question related to the data parallel and your code. I ran provided train_lm.sh scripts. https://github.com/ethnhe/FFB6D/blob/master/ffb6d/train_lm.sh When I ran the script, I got out of memory. My main problem comes from the 0 GPU get more memories than others.

Q1. Have you experienced similar issues?? or this phenomenon is natural?? I observed that the number of workers affected 0 GPU usage. Can you explain where the 147MiB is allocated in your code?? https://github.com/ethnhe/FFB6D/blob/3579a69df27451f74f3abc06964bba1bc7d40605/ffb6d/train_lm.py#L571

I think normalSpeed.depth_normal uses memory and the amount of memory depends on the size of workers. Does it right? Do you have a solution to make this memory allocates distributed??

https://github.com/ethnhe/FFB6D/blob/3579a69df27451f74f3abc06964bba1bc7d40605/ffb6d/datasets/linemod/linemod_dataset.py#L252

Current my GPU is TITAN Xp. Screenshot from 2021-09-05 12-50-37

taeyeopl commented 3 years ago

I solved the problem using some tricks. If you can share some comments related to your experiences, It would be really helpful.

https://github.com/ethnhe/FFB6D/blob/3579a69df27451f74f3abc06964bba1bc7d40605/ffb6d/train_lm.py#L109

I changed CUDA_VISIBLE_DEVICES and others

os.environ['CUDA_VISIBLE_DEVICES'] = str(args.local_rank)
torch.cuda.set_device(0)
device = torch.device('cuda:{}'.format(0))
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[0], output_device=0,
find_unused_parameters=True)

Screenshot from 2021-09-05 19-43-50

an99990 commented 3 years ago

hey @taeyeop-lee i just tried changing the lines like you did. I get the error that model is not defined for this

model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[0], output_device=0,
find_unused_parameters=True)

ethnhe / FFB6D

CUDA out of memory related to data parallel #35