Multi GPU training #581

Closed dnth closed 2 years ago

dnth commented 3 years ago

Is there a method to train efficientdet models on multi gpu setup?

lgvaz commented 3 years ago

Since we support pytorch-lightning you can use that, just pass the argument gpus=<number> to pl.Trainer.

I've never tried that so far so some errors might pop up? Would you like to try and report back what you find?

dnth commented 3 years ago

So far I have been trying to get multigpu training with fastai model. Here are my codes

model = efficientdet.model('tf_efficientdet_lite0', num_classes=len(class_map), img_size=size)
metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]
learn = efficientdet.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics, opt_func=ranger)

Error message pops up when the training starts

raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([4, 128, 128, 27])) must be the same as input size (torch.Size([2, 128, 128, 27]))

I have 2 GPUs. My batch size is set to 4.

My guess would be the error was because the batch was equally divided for each GPU. So each GPU gets 2 images and that doesnt tally with the original batch size of 4.

dnth commented 3 years ago

Following your suggestions above on using the pl.Trainer, I have tried to replace the line in quickstart.ipynb with the following

trainer = pl.Trainer(max_epochs=50, gpus=-1, distributed_backend="dp") Similar error pops up.

  | Name  | Type          | Params
HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…
potipot commented 3 years ago

Hi, I experimented a bit with Multi GPU training using both fastai and pytorch lightning. Concerning fastai, it is advisable to switch to newer dependencies fastai==2.2.2 where scripting module is incorporated in the main library. I was able to configure my script and launch it with

python -m fastai.launch --gpus 0,1

but here the problem that occurs is:

I managed to fix the error with the following patch:

def create_batch(self:DataLoader, b):
    return efficientdet.dataloaders.build_train_batch(b)

When running on pytorch lightning I tried with DDP trainer accelerator trainer = pl.Trainer(accelerator='ddp', *args) but then I encountered a memory leak which occurs on validation step, described here: and also wasn't able to resolve it.

potipot commented 3 years ago

I can confirm that Multi-gpu training works with pytorch lightning using DDP accelerator:

trainer = pl.Trainer(max_epochs=10, gpus=[0,1], accelerator='ddp')

The memory leak was coming from COCOMetric accumulation. Will update with fastai distributed training.

lgvaz commented 3 years ago

The memory leak was coming from COCOMetric accumulation.

This is such excellent news :heart:

potipot commented 3 years ago

@lgvaz could you assign this one to me? I should remember to make some tutorials on how to do it :D

lgvaz commented 3 years ago

Here ya go @potipot !!! Thanks for the initiative!

deepwilson commented 2 years ago

Is it possible to use multi-gpu setup with fastai in my jupyter notebook itself?

potipot commented 2 years ago

It is possible but I wouldn't recommend it. AFAIR the only supported multi-gpu paradigm that can work in jupyter notebook is DataParallel. (DP in Pytorch Lightning). This is usually suboptimal and inferior to other paradigms such us DDP or DDP2. These however require to use scripts.