Closed isalirezag closed 3 years ago
@foreverYoungGitHub @ShuangXieIrene can you please provide your feedback on this matter? Thanks
@isalirezag I had meet the same issue , are you solved it ?
@VisionZQ I am afraid not
you can refer to this https://github.com/lzx1413/PytorchSSD/blob/0.4/ssds_train.py @isalirezag @VisionZQ
I had similar question. Any ideas?
Is there something wrong with doing the following?
self.use_gpu = torch.cuda.is_available()
if self.use_gpu:
print('Utilize GPUs for computation')
print('Number of GPU available', torch.cuda.device_count())
self.model.cuda()
self.priors.cuda()
cudnn.benchmark = True
if torch.cuda.device_count() > 1:
device_ids = list(range(torch.cuda.device_count()))
# self.model = torch.nn.DataParallel(self.model).module
orig_model = self.model
self.model = torch.nn.DataParallel(self.model, device_ids=device_ids)
for module in cfg.TRAIN.TRAINABLE_SCOPE.split(','):
setattr(self.model, module, getattr(orig_model, module))
When I originally called torch.nn.DataParallel(self.model, device_ids=device_ids)
, trainable_param = self.trainable_param(cfg.TRAIN.TRAINABLE_SCOPE)
was returning an empty parameter list. This is because the trainable param attributes (base, norm, extras, loc, conf) are no longer attributes of the model wrapped with torch.nn.DataParallel
. So I went ahead and assigned the fields from cfg.TRAIN.TRAINABLE_SCOPE
to the newly wrapped model.
I have something running that appears to be getting a reasonable speedup, and the loss is making reasonable progress. Wanted to see if someone could find anything that would lead to incorrect behavior.
I use torch.nn.DataParallel as follows:
self.model = torch.nn.DataParallel(self.model).module
with
self.para_model = torch.nn.DataParallel(self.model)
self.model = self.para_model.module
self.train_epoch(self.model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu)
with
self.train_epoch(self.para_model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu)
I use torch.nn.DataParallel as follows:
- replace
self.model = torch.nn.DataParallel(self.model).module
withself.para_model = torch.nn.DataParallel(self.model)
self.model = self.para_model.module
- replace
self.train_epoch(self.model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu)
withself.train_epoch(self.para_model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu)
@CrawlingD I replace code following your answer, but meets the following error:
Traceback (most recent call last):
File "train.py", line 40, in
@wilxy maybe it's because your batch size is not divisible by the number of gpus.
@CrawlingD
Trainable scope: base,norm,extras,loc,conf Loading initial model weights from ./weights/rfb/resnet50_rfb_voc_81.2.pth => no checkpoint found at './weights/rfb/resnet50_rfb_voc_81.2.pth' Epoch 1/100: /home/huangfu/github/ssds.pytorch/lib/ssds_train.py:281: UserWarning: volatile was removed and now has no effect. Use
with torch.no_grad():
instead. targets = [Variable(anno.cuda(), volatile=True) for anno in targets] /home/huangfu/anaconda3/envs/ssds-pytorch/lib/python3.6/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train.py", line 44, intrain() File "train.py", line 41, in train train_model() File "/home/huangfu/github/ssds.pytorch/lib/ssds_train.py", line 602, in train_model s.train_model() File "/home/huangfu/github/ssds.pytorch/lib/ssds_train.py", line 232, in train_model self.train_epoch(self.model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu) File "/home/huangfu/github/ssds.pytorch/lib/ssds_train.py", line 291, in train_epoch loss_l, loss_c = criterion(out, targets) File "/home/huangfu/anaconda3/envs/ssds-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/home/huangfu/github/ssds.pytorch/lib/layers/modules/multibox_loss.py", line 91, in forward loss_c[pos] = 0 # filter out pos boxes for now IndexError: The shape of the mask [2, 11620] at index 0 does not match the shape of the indexed tensor [23240, 1] at index 0 what should i do
I uncomment the lines that has torch.nn.DataParallel but the model still cannot run in parallel. It runs but it still use just one gpu Any suggestion to solve it?