PDillis / stylegan3-fun

Modifications of the official PyTorch implementation of StyleGAN3. Let's easily generate images and videos with StyleGAN2/2-ADA/3!
Other
230 stars 36 forks source link

Training stalls when using multiple GPU's #33

Open nuclearsugar opened 1 year ago

nuclearsugar commented 1 year ago

I have been struggling to utilize 2 GPU's when training. After executing the code below, everything loads as usual, and then it stalls when reaching the training step. But when I execute the code below using <--gpus=1> then it run perfectly. python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

I'm not running out of VRAM (x2: Quadro RTX 5000 16GB) or RAM (32GB). Here is a screenshot where you can see both GPU's have 0% load for an extended time: 2023-04-04 16_04_10-Greenshot

I believe that both GPU's are correctly setup and StyleGAN2 should be able to use them both. Here is a screenshot after having run: nvidia-smi 2023-04-04 16_07_56-Window

I was doing some googling to see if anyone else has had a similar issue... And interestingly this recent issue over on the original repository seems to describe my problem precisely. Yet when I tried out the suggested fix then I still experienced the same problem as before with it stalling upon reaching the training step.

Am I missing some detail or is this a bug? Thanks!

nuclearsugar commented 1 year ago

I looked through the history of issues and here are 3 others with the same bug:

nuclearsugar commented 1 year ago

In prior tests I was relying on CUDA 11.1.

Seeing as how the environment.yml lists CUDA 11.3, I thought it would be worth testing out with the required CUDA library version. It took some tinkering but I was able to get CUDA 11.3 functional with the latest version of this repo. But I'm still seeing the same stalling behavior. So it stalls when executing --gpus=2, but --gpus=1 runs smoothly.

nuclearsugar commented 1 year ago

I tried another few tests where I set the environment variable to a specific GPU so that the StyleGAN training would only execute on a specific GPU. So I can confirm that both of my GPU's are setup correctly for use in Python.

Training runs smoothly on GPU0. --- set CUDA_VISIBLE_DEVICES=0 --- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training runs smoothly on GPU1. --- set CUDA_VISIBLE_DEVICES=1 --- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training stalls as described prior. --- set CUDA_VISIBLE_DEVICES=0,1 --- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

nuclearsugar commented 1 year ago

I was finally able to get the training to execute successfully on 2 GPU's after following the directions found over on issue 218. It's a bit of a hack but it works. FYI I'm running Windows 10.

Would it be possible to implement a more permanent fix for this bug?

PDillis commented 1 year ago

That is indeed a bit of a hack. I haven't encountered errors when training with multiple GPUs (RTX 6000 and A40s), so perhaps there's something else I'm missing. I'll try to figure it out, but if you can share more on your environment and such, that'd be helpful to narrow it down.

nuclearsugar commented 1 year ago

I saw a comment from a contributor on the StyleGAN3 codebase mentioning that they don't typically run mult-GPU setups using Windows, presumably Linux instead. So I'm not sure how heavily it's been tested on Windows. The other issues linked above also mention using Windows, so that seems telling.

Below is some info about my environment setup and hardware. Let me know if you need any other details.

Software Environment

Hardware