NVlabs / stylegan2-ada-pytorch

StyleGAN2-ADA - Official PyTorch implementation
https://arxiv.org/abs/2006.06676
Other
4.12k stars 1.16k forks source link

train.py fails when gpus=2 (or something other than gpus=1) #139

Closed metaphorz closed 3 years ago

metaphorz commented 3 years ago

OS: CentOS Version 7 Python: 3.7.6 Pytorch Version: 1.7.1+cu110 GPU: 2 V100s Docker: No, have not gone that route yet Related Posted Issues: none that I could find based solely on GPU count

I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions, I was able to do successful training with gpus=1. So, gpus=1 is working.

The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors: (Traceback truncated and file references anonymized.)

Distributing across 2 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Truncated Traceback (most recent call last): torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus) File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Truncated Traceback (most recent call last): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain) File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal) RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0

woctezuma commented 3 years ago

91, #98 just in case it helps, even though I know you already had a look. You are correct that it is only tangentially related, because the numbers do not match (4 and 2 here vs. 512 and 256 in my links).

It looks like the error happens here, even though the line numbers do not match:

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/training/loss.py#L116-L119

There is a sum of two terms, the first of which is: https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/training/loss.py#L94-L104

metaphorz commented 3 years ago

Also, found this and so asked the poster: https://github.com/lucidrains/stylegan2-pytorch/issues/209

metaphorz commented 3 years ago

Tried an experiment and got part-way there. There is a "cfg" config option in train.py. The config had been set to 11gb-gpu and that worked fine as long as gpus=1 but not >1. So I tried setting it to auto and while that worked with multiple gpus, the fake images generated were bizarre (mostly red or green--nothing like the starting network (wikiart.pkl) or the images to use in training). So, now I am retracing steps wondering whether there is a config that will generate accurate fake images on multiple gpus. Too see all config options, look in train.py for variable cfg_specs. If I find something, I'll report back.

metaphorz commented 3 years ago

--cfg='stylegan2' works for me on a trial with one node and two gpus

woctezuma commented 3 years ago

You are using a fork, because the config (11gb-gpu) which you mentioned is not part of this repository.

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/train.py#L154-L163

MoemaMike commented 3 years ago

Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news

metaphorz commented 3 years ago

you are right. I am using a fork (from dvschultz); however, look at the function setup_training_loop_kwargs where

cfg is defined as an option:

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/train.py

It is under # Base config.

-p

PS> I just realized that you were right on the 11gb-gpu. Not sure where that came from.

Paul Fishwick, PhD Distinguished University Chair of Arts, Technology, and Emerging Communication Professor of Computer Science Director, Creative Automata Laboratory The University of Texas at Dallas Arts & Technology 800 West Campbell Road, AT10 Richardson, TX 75080-3021 Home: utdallas.edu/atec/fishwick Media: @.*** Modeling: digest.sigsim.org Twitter: @PaulFishwick

ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout

From: Wok @.> Reply-To: NVlabs/stylegan2-ada-pytorch @.> Date: Friday, July 9, 2021 at 1:15 AM To: NVlabs/stylegan2-ada-pytorch @.> Cc: Paul Fishwick @.>, Author @.***> Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139)

Okay, then you are using a fork, because the config which you mentioned is not part of this repository.

https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/train.py#L154-L163

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

metaphorz commented 3 years ago

I also use Colab Pro. As a Colab Pro user, to my knowledge, you have access to a node that contains only one GPU.

Typically, this will be a P100 GPU but if lucky, you get a V100.

So, for multi-GPUs, you need to go the server route, which admittedly a bit painful compared with Colab. I think

Paperspace and vast.ai support multi GPUs. So, my workflow consists of starting on Colab, creating or modifying

a notebook, and then translating this to a server to get to multiple GPUs.

-p

Paul Fishwick, PhD Distinguished University Chair of Arts, Technology, and Emerging Communication Professor of Computer Science Director, Creative Automata Laboratory The University of Texas at Dallas Arts & Technology 800 West Campbell Road, AT10 Richardson, TX 75080-3021 Home: utdallas.edu/atec/fishwick Media: @.*** Modeling: digest.sigsim.org Twitter: @PaulFishwick

ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout

From: MoemaMike @.> Reply-To: NVlabs/stylegan2-ada-pytorch @.> Date: Friday, July 9, 2021 at 11:46 AM To: NVlabs/stylegan2-ada-pytorch @.> Cc: Paul Fishwick @.>, Author @.***> Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139)

Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

woctezuma commented 3 years ago

PS: I just realized that you were right on the 11gb-gpu. Not sure where that came from.

It is part of the fork. I know this fork, even though I don't use it. :)