genforce / insgen

[NeurIPS 2021] Data-Efficient Instance Generation from Instance Discrimination
https://genforce.github.io/insgen/
Other
101 stars 4 forks source link

Training does not work with 1 GPU #2

Open kata44 opened 3 years ago

kata44 commented 3 years ago

There seems to be a problem with the contrastive loss when using 1 GPU to train, training only works when setting no_insgen=true.

The output is:

Setting up augmentation...
Distributing across 1 GPUs...
Distributing Contrastive Heads across 1 GPUS...
Setting up training phases...
Setting up contrastive training phases...
Exporting sample images...
Initializing logs...
2021-09-18 04:23:26.767334: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Training for 25000 kimg...

Traceback (most recent call last):
  File "train.py", line 583, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 576, in main
    subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
  File "train.py", line 421, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/home/katarina/ML/insgen/training/training_loop.py", line 326, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain, cl_phases=cl_phases, D_ema=D_ema, g_fake_cl=not no_cl_on_g, **cl_loss_weight)
  File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 156, in accumulate_gradients
    loss_Dreal = loss_Dreal + lw_real_cl * self.run_cl(real_img_tmp, real_c, sync, Dphase.module, D_ema, loss_name='D_cl')
  File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 71, in run_cl
    loss = contrastive_head(logits0, logits1, loss_only=loss_only, update_q=update_q)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 183, in forward
    self._dequeue_and_enqueue(k)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 51, in _dequeue_and_enqueue
    keys = concat_all_gather(keys)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 197, in concat_all_gather
    for _ in range(torch.distributed.get_world_size())]
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
    return _get_group_size(group)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
    default_pg = _get_default_group()
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
kata44 commented 3 years ago

I attempted this fix:

diff --git a/training/contrastive_head.py b/training/contrastive_head.py
index e09367e..4517bac 100644
--- a/training/contrastive_head.py
+++ b/training/contrastive_head.py
@@ -189,10 +189,15 @@ class CLHead(torch.nn.Module):

 @torch.no_grad()
 def concat_all_gather(tensor):
+
+    if not torch.distributed.is_initialized():
+        return tensor
+
     """
     Performs all_gather operation on the provided tensors.
     *** Warning ***: torch.distributed.all_gather has no gradient.
     """
+
     tensors_gather = [torch.ones_like(tensor)
         for _ in range(torch.distributed.get_world_size())]
     torch.distributed.all_gather(tensors_gather, tensor, async_op=False)
diff --git a/training/training_loop.py b/training/training_loop.py
index a09c5a1..efbef17 100755
--- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -398,9 +398,11 @@ def training_loop(
             snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
             for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
                 if module is not None:
-                    if name in ['DHead', 'GHead']:
-                        module = module.module
                     if num_gpus > 1:
+
+                        if name in ['DHead', 'GHead']:
+                            module = module.module
+
                         misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
                     module = copy.deepcopy(module).eval().requires_grad_(False).cpu()
                 snapshot_data[name] = module

However, this halves training throughput with insgen enabled and according to the paper, "the extra computing load is extremely small and the training efficiency is barely affected", so I assume this is not doing the right thing.

Johnson-yue commented 2 years ago

@kata44 I alse want to train with only 1GPU, and I think only modified ‘concat_all_gather()’ function is not correct.

as the code said in _batch_shuffle_ddp()[https://github.com/genforce/insgen/blob/52bda7cfe59094fbb2f533a0355fff1392b0d380/training/contrastive_head.py#L73-L75] and _batch_unshuffle_ddp()

RuoyuGuo commented 2 years ago

Hi, I am not familiar with multiple GPUs training but I think the bug is triggered by def _dequeue_and_enqueue(...) from contrastive_head.py.

Now look at line 51 keys = concat_all_gather(keys), I guess this line code only concatenate distributed tensors from different GPUs, but for 1 GPU, it is not necessary for 1 GPU. So I simply delete this line if train on 1 GPU

49xxy commented 2 years ago

嗨,我不熟悉多 GPU 训练,但我认为该错误是由def _dequeue_and_enqueue(...)from触发的contrastive_head.py

现在看第 51 行,我猜这行代码只连接来自不同 GPU 的分布式张量,但对于 1 个 GPU,1 个 GPU 没有必要。因此,如果在 1 个 GPU 上训练,我只需删除此行 keys = concat_all_gather(keys)

Have you solved this problem? Can you train with one GPU?

49xxy commented 2 years ago

嗨,我不熟悉多 GPU 训练,但我认为该错误是由def _dequeue_and_enqueue(...)from触发的contrastive_head.py

现在看第 51 行,我猜这行代码只连接来自不同 GPU 的分布式张量,但对于 1 个 GPU,1 个 GPU 没有必要。因此,如果在 1 个 GPU 上训练,我只需删除此行 keys = concat_all_gather(keys)

Hi!Can I delete this line for normal training?

GilesBathgate commented 1 year ago

I think the issue is simply that the process groups need to be initialised even if there is only one GPU see the patch in #5