Open taeyeopl opened 3 years ago
I solved the problem using some tricks. If you can share some comments related to your experiences, It would be really helpful.
https://github.com/ethnhe/FFB6D/blob/3579a69df27451f74f3abc06964bba1bc7d40605/ffb6d/train_lm.py#L109
I changed CUDA_VISIBLE_DEVICES and others
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.local_rank)
torch.cuda.set_device(0)
device = torch.device('cuda:{}'.format(0))
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[0], output_device=0,
find_unused_parameters=True)
hey @taeyeop-lee i just tried changing the lines like you did. I get the error that model is not defined for this
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[0], output_device=0,
find_unused_parameters=True)
I have a simple question related to the data parallel and your code. I ran provided train_lm.sh scripts. https://github.com/ethnhe/FFB6D/blob/master/ffb6d/train_lm.sh When I ran the script, I got out of memory. My main problem comes from the 0 GPU get more memories than others.
Q1. Have you experienced similar issues?? or this phenomenon is natural?? I observed that the number of workers affected 0 GPU usage. Can you explain where the 147MiB is allocated in your code?? https://github.com/ethnhe/FFB6D/blob/3579a69df27451f74f3abc06964bba1bc7d40605/ffb6d/train_lm.py#L571
I think normalSpeed.depth_normal uses memory and the amount of memory depends on the size of workers. Does it right? Do you have a solution to make this memory allocates distributed??
https://github.com/ethnhe/FFB6D/blob/3579a69df27451f74f3abc06964bba1bc7d40605/ffb6d/datasets/linemod/linemod_dataset.py#L252
Current my GPU is TITAN Xp.