ShuangXieIrene / ssds.pytorch

Repository for Single Shot MultiBox Detector and its variants, implemented with pytorch, python3.
MIT License
571 stars 166 forks source link

not able to use torch.nn.DataParallel #26

Closed isalirezag closed 3 years ago

isalirezag commented 6 years ago

I uncomment the lines that has torch.nn.DataParallel but the model still cannot run in parallel. It runs but it still use just one gpu Any suggestion to solve it?

isalirezag commented 5 years ago

@foreverYoungGitHub @ShuangXieIrene can you please provide your feedback on this matter? Thanks

VisionZQ commented 5 years ago

@isalirezag I had meet the same issue , are you solved it ?

isalirezag commented 5 years ago

@VisionZQ I am afraid not

lzx1413 commented 5 years ago

you can refer to this https://github.com/lzx1413/PytorchSSD/blob/0.4/ssds_train.py @isalirezag @VisionZQ

tzrtzr000 commented 5 years ago

I had similar question. Any ideas?

giulio-zhou commented 5 years ago

Is there something wrong with doing the following?

        self.use_gpu = torch.cuda.is_available()
        if self.use_gpu:
            print('Utilize GPUs for computation')
            print('Number of GPU available', torch.cuda.device_count())
            self.model.cuda()
            self.priors.cuda()
            cudnn.benchmark = True
            if torch.cuda.device_count() > 1:
                device_ids = list(range(torch.cuda.device_count()))
                # self.model = torch.nn.DataParallel(self.model).module
                orig_model = self.model
                self.model = torch.nn.DataParallel(self.model, device_ids=device_ids)
                for module in cfg.TRAIN.TRAINABLE_SCOPE.split(','):
                    setattr(self.model, module, getattr(orig_model, module))

When I originally called torch.nn.DataParallel(self.model, device_ids=device_ids), trainable_param = self.trainable_param(cfg.TRAIN.TRAINABLE_SCOPE) was returning an empty parameter list. This is because the trainable param attributes (base, norm, extras, loc, conf) are no longer attributes of the model wrapped with torch.nn.DataParallel. So I went ahead and assigned the fields from cfg.TRAIN.TRAINABLE_SCOPE to the newly wrapped model. I have something running that appears to be getting a reasonable speedup, and the loss is making reasonable progress. Wanted to see if someone could find anything that would lead to incorrect behavior.

CrawlingD commented 5 years ago

I use torch.nn.DataParallel as follows:

  1. replace self.model = torch.nn.DataParallel(self.model).module with self.para_model = torch.nn.DataParallel(self.model) self.model = self.para_model.module
  2. replace self.train_epoch(self.model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu) with self.train_epoch(self.para_model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu)
wilxy commented 5 years ago

I use torch.nn.DataParallel as follows:

  1. replace self.model = torch.nn.DataParallel(self.model).module with self.para_model = torch.nn.DataParallel(self.model) self.model = self.para_model.module
  2. replace self.train_epoch(self.model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu) with self.train_epoch(self.para_model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu)

@CrawlingD I replace code following your answer, but meets the following error: Traceback (most recent call last): File "train.py", line 40, in train() File "train.py", line 37, in train train_model() File "/home/zzy/yolo_mobile/yolo_mobilenet_pytorch/lib/ssds_train.py", line 608, in train_model s.train_model() File "/home/zzy/yolo_mobile/yolo_mobilenet_pytorch/lib/ssds_train.py", line 237, in train_model self.use_gpu) File "/home/zzy/yolo_mobile/yolo_mobilenet_pytorch/lib/ssds_train.py", line 293, in train_epoch out = model(images, phase='train') File "/home/zzy/anaconda3/envs/ssds/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, kwargs) File "/home/zzy/anaconda3/envs/ssds/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 73, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/zzy/anaconda3/envs/ssds/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 83, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/zzy/anaconda3/envs/ssds/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output File "/home/zzy/anaconda3/envs/ssds/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker output = module(*input, *kwargs) File "/home/zzy/anaconda3/envs/ssds/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(input, kwargs) TypeError: forward() missing 1 required positional argument: 'x'

CrawlingD commented 5 years ago

@wilxy maybe it's because your batch size is not divisible by the number of gpus.

Damon2019 commented 4 years ago

@CrawlingD

Trainable scope: base,norm,extras,loc,conf Loading initial model weights from ./weights/rfb/resnet50_rfb_voc_81.2.pth => no checkpoint found at './weights/rfb/resnet50_rfb_voc_81.2.pth' Epoch 1/100: /home/huangfu/github/ssds.pytorch/lib/ssds_train.py:281: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. targets = [Variable(anno.cuda(), volatile=True) for anno in targets] /home/huangfu/anaconda3/envs/ssds-pytorch/lib/python3.6/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train.py", line 44, in train() File "train.py", line 41, in train train_model() File "/home/huangfu/github/ssds.pytorch/lib/ssds_train.py", line 602, in train_model s.train_model() File "/home/huangfu/github/ssds.pytorch/lib/ssds_train.py", line 232, in train_model self.train_epoch(self.model, self.train_loader, self.optimizer, self.criterion, self.writer, epoch, self.use_gpu) File "/home/huangfu/github/ssds.pytorch/lib/ssds_train.py", line 291, in train_epoch loss_l, loss_c = criterion(out, targets) File "/home/huangfu/anaconda3/envs/ssds-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/home/huangfu/github/ssds.pytorch/lib/layers/modules/multibox_loss.py", line 91, in forward loss_c[pos] = 0 # filter out pos boxes for now IndexError: The shape of the mask [2, 11620] at index 0 does not match the shape of the indexed tensor [23240, 1] at index 0 what should i do