DDP multi-GPU training with PyTorch Lightning

AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI

GNU Affero General Public License v3.0

143.4k stars 27.01k forks source link

DDP multi-GPU training with PyTorch Lightning #2475

Open JohnnyRacer opened 2 years ago

JohnnyRacer commented 2 years ago

Is there a way to add training for Dreambooth / TI / Hypernetwork training with PyTorch Lightning's trainer class using DDP strategy as featured in @XavierXiao's repo. It allows for a very pain-free experience when using multi-GPU training compared to native PyTorch's solutions. Correct me if I'm wrong but from what I've gathered there isn't a clean way to do this type of training with the code that's available now. If anyone has more information about how to do proper multi-GPU training please feel free to chime in.

Code Here by XavierXiao

Desm0nt commented 2 years ago

It allows for a very pain-free experience when using multi-GPU training compared to native PyTorch's solutions.

Unfortunately it brings a huge amount of pain if you try to train with ddp on a machine with Multiple Nvidia GPU and AMD Ryzen/Epic CPU =( And if on local machine you can disable IOMMU in bios and bypass issue, on rental cloud VM it's impossible =(

JohnnyRacer commented 2 years ago

@Desm0nt I've not heard about this before, any ideas why this happens? I've trained Dreambooth using DDP using the repo linked above on multi Ampere based GPUs in one machine and have not experienced this problem using Intel based CPUs.

mzawdx2022 commented 1 year ago

@Desm0nt I've not heard about this before, any ideas why this happens? I've trained Dreambooth using DDP using the repo linked above on multi Ampere based GPUs in one machine and have not experienced this problem using Intel based CPUs.

How did you train Dreambooth using DDP? Could you give some steps? thanks.