Open kata44 opened 3 years ago
I attempted this fix:
diff --git a/training/contrastive_head.py b/training/contrastive_head.py
index e09367e..4517bac 100644
--- a/training/contrastive_head.py
+++ b/training/contrastive_head.py
@@ -189,10 +189,15 @@ class CLHead(torch.nn.Module):
@torch.no_grad()
def concat_all_gather(tensor):
+
+ if not torch.distributed.is_initialized():
+ return tensor
+
"""
Performs all_gather operation on the provided tensors.
*** Warning ***: torch.distributed.all_gather has no gradient.
"""
+
tensors_gather = [torch.ones_like(tensor)
for _ in range(torch.distributed.get_world_size())]
torch.distributed.all_gather(tensors_gather, tensor, async_op=False)
diff --git a/training/training_loop.py b/training/training_loop.py
index a09c5a1..efbef17 100755
--- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -398,9 +398,11 @@ def training_loop(
snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
if module is not None:
- if name in ['DHead', 'GHead']:
- module = module.module
if num_gpus > 1:
+
+ if name in ['DHead', 'GHead']:
+ module = module.module
+
misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
module = copy.deepcopy(module).eval().requires_grad_(False).cpu()
snapshot_data[name] = module
However, this halves training throughput with insgen enabled and according to the paper, "the extra computing load is extremely small and the training efficiency is barely affected", so I assume this is not doing the right thing.
@kata44 I alse want to train with only 1GPU, and I think only modified ‘concat_all_gather()’ function is not correct.
as the code said in _batch_shuffle_ddp()[https://github.com/genforce/insgen/blob/52bda7cfe59094fbb2f533a0355fff1392b0d380/training/contrastive_head.py#L73-L75] and _batch_unshuffle_ddp()
Hi, I am not familiar with multiple GPUs training but I think the bug is triggered by def _dequeue_and_enqueue(...)
from contrastive_head.py
.
Now look at line 51 keys = concat_all_gather(keys)
, I guess this line code only concatenate distributed tensors from different GPUs, but for 1 GPU, it is not necessary for 1 GPU. So I simply delete this line if train on 1 GPU
嗨,我不熟悉多 GPU 训练,但我认为该错误是由
def _dequeue_and_enqueue(...)
from触发的contrastive_head.py
。现在看第 51 行,我猜这行代码只连接来自不同 GPU 的分布式张量,但对于 1 个 GPU,1 个 GPU 没有必要。因此,如果在 1 个 GPU 上训练,我只需删除此行
keys = concat_all_gather(keys)
Have you solved this problem? Can you train with one GPU?
嗨,我不熟悉多 GPU 训练,但我认为该错误是由
def _dequeue_and_enqueue(...)
from触发的contrastive_head.py
。现在看第 51 行,我猜这行代码只连接来自不同 GPU 的分布式张量,但对于 1 个 GPU,1 个 GPU 没有必要。因此,如果在 1 个 GPU 上训练,我只需删除此行
keys = concat_all_gather(keys)
Hi!Can I delete this line for normal training?
I think the issue is simply that the process groups need to be initialised even if there is only one GPU see the patch in #5
There seems to be a problem with the contrastive loss when using 1 GPU to train, training only works when setting no_insgen=true.
The output is: