genforce / insgen

[NeurIPS 2021] Data-Efficient Instance Generation from Instance Discrimination
https://genforce.github.io/insgen/
Other
100 stars 4 forks source link

How to train insgen with only 1 GPU #5

Open Johnson-yue opened 2 years ago

Johnson-yue commented 2 years ago

thanks for your sharing, but I only have 1 GPU , it can not be trained

I see the reason why need multi-GPU is for ‘effect of disabling shuffle BN to MoCo’

but I can not understand why must shuffle batch data among all gpus not only in GPU?

would you provide a way to shuffle batch date on 1 GPU, it can be not ‘effect’ ?

49xxy commented 2 years ago

I ran this on 2 gpus much slower than the baseline stylegan2, taking nearly twice as long.Then,I followed the solution in the issue1 and ran it on Colab. Again, it took twice as long.

GilesBathgate commented 1 year ago

If I apply this change I can run on a single GPU:

--- a/train.py
+++ b/train.py
@@ -413,7 +413,7 @@ def subprocess_fn(rank, args, temp_dir):
     dnnlib.util.Logger(file_name=os.path.join(args.run_dir, 'log.txt'), file_mode='a', should_flush=True)

     # Init torch.distributed.
-    if args.num_gpus > 1:
+    if args.num_gpus > 0:

The key to the above patch is that even with 1 GPU the following code needs to run to init process groups via the torch.distributed.init_process_group function https://github.com/genforce/insgen/blob/52bda7cfe59094fbb2f533a0355fff1392b0d380/train.py#L406-L430

I changed the > condition for brevity.

O-O1024 commented 1 year ago

If I apply this change I can run on a single GPU:

--- a/train.py
+++ b/train.py
@@ -413,7 +413,7 @@ def subprocess_fn(rank, args, temp_dir):
     dnnlib.util.Logger(file_name=os.path.join(args.run_dir, 'log.txt'), file_mode='a', should_flush=True)

     # Init torch.distributed.
-    if args.num_gpus > 1:
+    if args.num_gpus > 0:

The key to the above patch is that even with 1 GPU the following code needs to run to init process groups via the torch.distributed.init_process_group function

https://github.com/genforce/insgen/blob/52bda7cfe59094fbb2f533a0355fff1392b0d380/train.py#L406-L430

I changed the > condition for brevity.

This method does not work on my mathine. Many different strange bugs prompted. I don't know why.

GilesBathgate commented 1 year ago

@jkla139 I actually used the copy from https://github.com/Zhendong-Wang/Diffusion-GAN . I cloned from the main branch of this repo and got:

:$ python train.py --gpus=1 ...
...
  File "/home/giles/projects/insgen/training/training_loop.py", line 407, in training_loop
    module = module.module
  File "/home/giles/projects/stylegan2-ada-pytorch/.env/stylegan2-ada-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 778, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'CLHead' object has no attribute 'module'

I tracked this down to the following:

--- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -403,8 +403,6 @@ def training_loop(
             snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
             for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
                 if module is not None:
-                    if name in ['DHead', 'GHead']:
-                        module = module.module
                     if num_gpus > 1:
                         misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
                     module = copy.deepcopy(module).eval().requires_grad_(False).cpu()

diffusion-gan doesn't seem to have this, so I just removed those lines. It seems to be working.

O-O1024 commented 1 year ago

@jkla139 I actually used the copy from https://github.com/Zhendong-Wang/Diffusion-GAN . I cloned from the main branch of this repo and got:

:$ python train.py --gpus=1 ...
...
  File "/home/giles/projects/insgen/training/training_loop.py", line 407, in training_loop
    module = module.module
  File "/home/giles/projects/stylegan2-ada-pytorch/.env/stylegan2-ada-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 778, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'CLHead' object has no attribute 'module'

I tracked this down to the following:

--- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -403,8 +403,6 @@ def training_loop(
             snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
             for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
                 if module is not None:
-                    if name in ['DHead', 'GHead']:
-                        module = module.module
                     if num_gpus > 1:
                         misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
                     module = copy.deepcopy(module).eval().requires_grad_(False).cpu()

diffusion-gan doesn't seem to have this, so I just removed those lines. It seems to be working.

Yes, these two lines need to be deleted, now it's work.