Closed dnth closed 2 years ago
Since we support pytorch-lightning you can use that, just pass the argument gpus=<number>
to pl.Trainer
.
I've never tried that so far so some errors might pop up? Would you like to try and report back what you find?
So far I have been trying to get multigpu training with fastai model. Here are my codes
model = efficientdet.model('tf_efficientdet_lite0', num_classes=len(class_map), img_size=size)
metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]
learn = efficientdet.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics, opt_func=ranger)
learn.to_parallel()
Error message pops up when the training starts
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([4, 128, 128, 27])) must be the same as input size (torch.Size([2, 128, 128, 27]))
I have 2 GPUs. My batch size is set to 4.
My guess would be the error was because the batch was equally divided for each GPU. So each GPU gets 2 images and that doesnt tally with the original batch size of 4.
Following your suggestions above on using the pl.Trainer
, I have tried to replace the line in quickstart.ipynb with the following
trainer = pl.Trainer(max_epochs=50, gpus=-1, distributed_backend="dp")
Similar error pops up.
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: False, using: 0 TPU cores
INFO:lightning:TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
----------------------------------------
0 | model | DetBenchTrain | 3 M
INFO:lightning:
| Name | Type | Params
----------------------------------------
0 | model | DetBenchTrain | 3 M
HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-f103b8e56be0> in <module>
1 # trainer = pl.Trainer(max_epochs=50, gpus=1)
2 trainer = pl.Trainer(max_epochs=50, gpus=-1, distributed_backend="dp")
----> 3 trainer.fit(light_model, train_dl, valid_dl)
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
442 self.call_hook('on_fit_start')
443
--> 444 results = self.accelerator_backend.train()
445 self.accelerator_backend.teardown()
446
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in train(self)
104
105 # train or test
--> 106 results = self.train_or_test()
107
108 return results
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
72 results = self.trainer.run_test()
73 else:
---> 74 results = self.trainer.train()
75 return results
76
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
464
465 def train(self):
--> 466 self.run_sanity_check(self.get_model())
467
468 self.checkpoint_connector.has_trained = False
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in run_sanity_check(self, ref_model)
656
657 # run eval step
--> 658 _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
659
660 # allow no returns from eval
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in run_evaluation(self, test_mode, max_batches)
576
577 # lightning module methods
--> 578 output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
579 output = self.evaluation_loop.evaluation_step_end(output)
580
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py in evaluation_step(self, test_mode, batch, batch_idx, dataloader_idx)
169 output = self.trainer.accelerator_backend.test_step(args)
170 else:
--> 171 output = self.trainer.accelerator_backend.validation_step(args)
172
173 # track batch size for weighted average
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in validation_step(self, args)
122
123 def validation_step(self, args):
--> 124 output = self.training_step(args)
125 return output
126
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in training_step(self, args)
118 output = self.trainer.model(*args)
119 else:
--> 120 output = self.trainer.model(*args)
121 return output
122
~/anaconda3/envs/aceic/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in forward(self, *inputs, **kwargs)
85
86 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
---> 87 outputs = self.parallel_apply(replicas, inputs, kwargs)
88
89 if isinstance(outputs[0], Result):
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
149
150 def parallel_apply(self, replicas, inputs, kwargs):
--> 151 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
152
153
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(modules, inputs, kwargs_tup, devices)
308 output = results[i]
309 if isinstance(output, Exception):
--> 310 raise output
311 outputs.append(output)
312 return outputs
~/anaconda3/envs/aceic/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in _worker(i, module, input, kwargs, device)
267 fx_called = 'test_step'
268 else:
--> 269 output = module.validation_step(*input, **kwargs)
270 fx_called = 'validation_step'
271
~/anaconda3/envs/aceic/lib/python3.8/site-packages/icevision/models/efficientdet/lightning/model_adapter.py in validation_step(self, batch, batch_idx)
42
43 with torch.no_grad():
---> 44 raw_preds = self(xb, yb)
45 preds = efficientdet.convert_raw_predictions(raw_preds["detections"], 0)
46 loss = efficientdet.loss_fn(raw_preds, yb)
~/anaconda3/envs/aceic/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/aceic/lib/python3.8/site-packages/icevision/models/efficientdet/lightning/model_adapter.py in forward(self, *args, **kwargs)
27
28 def forward(self, *args, **kwargs):
---> 29 return self.model(*args, **kwargs)
30
31 def training_step(self, batch, batch_idx):
~/anaconda3/envs/aceic/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/aceic/lib/python3.8/site-packages/effdet/bench.py in forward(self, x, target)
119 target['bbox'], target['cls'])
120
--> 121 loss, class_loss, box_loss = self.loss_fn(class_out, box_out, cls_targets, box_targets, num_positives)
122 output = {'loss': loss, 'class_loss': class_loss, 'box_loss': box_loss}
123 if not self.training:
~/anaconda3/envs/aceic/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/aceic/lib/python3.8/site-packages/effdet/loss.py in forward(self, cls_outputs, box_outputs, cls_targets, box_targets, num_positives)
254 l_fn = loss_jit
255
--> 256 return l_fn(
257 cls_outputs, box_outputs, cls_targets, box_targets, num_positives,
258 num_classes=self.num_classes, alpha=self.alpha, gamma=self.gamma, delta=self.delta,
~/anaconda3/envs/aceic/lib/python3.8/site-packages/effdet/loss.py in loss_fn(cls_outputs, box_outputs, cls_targets, box_targets, num_positives, num_classes, alpha, gamma, delta, box_loss_weight, label_smoothing, new_focal)
201 alpha=alpha, gamma=gamma, normalizer=num_positives_sum, label_smoothing=label_smoothing)
202 else:
--> 203 cls_loss = focal_loss_legacy(
204 cls_outputs_at_level, cls_targets_at_level_oh,
205 alpha=alpha, gamma=gamma, normalizer=num_positives_sum)
~/anaconda3/envs/aceic/lib/python3.8/site-packages/effdet/loss.py in focal_loss_legacy(logits, targets, alpha, gamma, normalizer)
39 """
40 positive_label_mask = targets == 1.0
---> 41 cross_entropy = F.binary_cross_entropy_with_logits(logits, targets.to(logits.dtype), reduction='none')
42 neg_logits = -1.0 * logits
43 modulator = torch.exp(gamma * targets * neg_logits - gamma * torch.log1p(torch.exp(neg_logits)))
~/anaconda3/envs/aceic/lib/python3.8/site-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2578
2579 if not (target.size() == input.size()):
-> 2580 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2581
2582 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([16, 48, 48, 45])) must be the same as input size (torch.Size([8, 48, 48, 45]))
Hi, I experimented a bit with Multi GPU training using both fastai and pytorch lightning. Concerning fastai, it is advisable to switch to newer dependencies fastai==2.2.2 where scripting module is incorporated in the main library. I was able to configure my script and launch it with
python -m fastai.launch --gpus 0,1 my_script.py
but here the problem that occurs is:
File "/home/toucan/anaconda3/envs/icevision/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 73, in default
_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/toucan/anaconda3/envs/icevision/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictco
mp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/toucan/anaconda3/envs/icevision/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 83, in default
_collate
return [default_collate(samples) for samples in transposed]
File "/home/toucan/anaconda3/envs/icevision/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listco
mp>
return [default_collate(samples) for samples in transposed]
File "/home/toucan/anaconda3/envs/icevision/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 85, in default
_collate
raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'icevision.core.bbox.$
Box'>
I managed to fix the error with the following patch:
@patch
def create_batch(self:DataLoader, b):
return efficientdet.dataloaders.build_train_batch(b)
When running on pytorch lightning I tried with DDP trainer accelerator trainer = pl.Trainer(accelerator='ddp', *args)
but then I encountered a memory leak which occurs on validation step, described here: https://github.com/PyTorchLightning/pytorch-lightning/issues/2352 and also wasn't able to resolve it.
I can confirm that Multi-gpu training works with pytorch lightning using DDP accelerator:
trainer = pl.Trainer(max_epochs=10, gpus=[0,1], accelerator='ddp')
The memory leak was coming from COCOMetric accumulation. Will update with fastai distributed training.
The memory leak was coming from COCOMetric accumulation.
This is such excellent news :heart:
@lgvaz could you assign this one to me? I should remember to make some tutorials on how to do it :D
Here ya go @potipot !!! Thanks for the initiative!
Resolved.
Is it possible to use multi-gpu setup with fastai in my jupyter notebook itself?
It is possible but I wouldn't recommend it. AFAIR the only supported multi-gpu paradigm that can work in jupyter notebook is DataParallel. (DP in Pytorch Lightning). This is usually suboptimal and inferior to other paradigms such us DDP or DDP2. These however require to use scripts. from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html:
Is there a method to train efficientdet models on multi gpu setup?