Closed metaphorz closed 3 years ago
It looks like the error happens here, even though the line numbers do not match:
There is a sum of two terms, the first of which is: https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/training/loss.py#L94-L104
Also, found this and so asked the poster: https://github.com/lucidrains/stylegan2-pytorch/issues/209
Tried an experiment and got part-way there. There is a "cfg" config option in train.py. The config had been set to 11gb-gpu and that worked fine as long as gpus=1 but not >1. So I tried setting it to auto and while that worked with multiple gpus, the fake images generated were bizarre (mostly red or green--nothing like the starting network (wikiart.pkl) or the images to use in training). So, now I am retracing steps wondering whether there is a config that will generate accurate fake images on multiple gpus. Too see all config options, look in train.py for variable cfg_specs. If I find something, I'll report back.
--cfg='stylegan2' works for me on a trial with one node and two gpus
You are using a fork, because the config (11gb-gpu
) which you mentioned is not part of this repository.
Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news
you are right. I am using a fork (from dvschultz); however, look at the function setup_training_loop_kwargs where
cfg is defined as an option:
https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/train.py
It is under # Base config.
-p
PS> I just realized that you were right on the 11gb-gpu. Not sure where that came from.
Paul Fishwick, PhD Distinguished University Chair of Arts, Technology, and Emerging Communication Professor of Computer Science Director, Creative Automata Laboratory The University of Texas at Dallas Arts & Technology 800 West Campbell Road, AT10 Richardson, TX 75080-3021 Home: utdallas.edu/atec/fishwick Media: @.*** Modeling: digest.sigsim.org Twitter: @PaulFishwick
ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout
From: Wok @.> Reply-To: NVlabs/stylegan2-ada-pytorch @.> Date: Friday, July 9, 2021 at 1:15 AM To: NVlabs/stylegan2-ada-pytorch @.> Cc: Paul Fishwick @.>, Author @.***> Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139)
Okay, then you are using a fork, because the config which you mentioned is not part of this repository.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I also use Colab Pro. As a Colab Pro user, to my knowledge, you have access to a node that contains only one GPU.
Typically, this will be a P100 GPU but if lucky, you get a V100.
So, for multi-GPUs, you need to go the server route, which admittedly a bit painful compared with Colab. I think
Paperspace and vast.ai support multi GPUs. So, my workflow consists of starting on Colab, creating or modifying
a notebook, and then translating this to a server to get to multiple GPUs.
-p
Paul Fishwick, PhD Distinguished University Chair of Arts, Technology, and Emerging Communication Professor of Computer Science Director, Creative Automata Laboratory The University of Texas at Dallas Arts & Technology 800 West Campbell Road, AT10 Richardson, TX 75080-3021 Home: utdallas.edu/atec/fishwick Media: @.*** Modeling: digest.sigsim.org Twitter: @PaulFishwick
ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout
From: MoemaMike @.> Reply-To: NVlabs/stylegan2-ada-pytorch @.> Date: Friday, July 9, 2021 at 11:46 AM To: NVlabs/stylegan2-ada-pytorch @.> Cc: Paul Fishwick @.>, Author @.***> Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139)
Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
PS: I just realized that you were right on the 11gb-gpu. Not sure where that came from.
It is part of the fork. I know this fork, even though I don't use it. :)
OS: CentOS Version 7 Python: 3.7.6 Pytorch Version: 1.7.1+cu110 GPU: 2 V100s Docker: No, have not gone that route yet Related Posted Issues: none that I could find based solely on GPU count
I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions, I was able to do successful training with gpus=1. So, gpus=1 is working.
The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors: (Traceback truncated and file references anonymized.)
Distributing across 2 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Truncated Traceback (most recent call last): torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus) File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:
-- Process 1 terminated with the following error: Truncated Traceback (most recent call last): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain) File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal) RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0