training error - Githubissues

wzr0108 commented 1 year ago

when I run the code following this command

python main.py --config config/meta_portrait_256_pretrain_warp.yaml --fp16 --stage Warp --task Pretrain

I met this error

start to train...
Epoch 1 Iter 0 D/Time : 6.768/00h00m06s warp_perceptual : 124.12;loss_G_init : 0.00;loss_D_init : 0.00loss_G_last : 0.00;loss_D_last : 0.00
Epoch 1 Iter 0 Step 0 event save
Traceback (most recent call last):
  File "main.py", line 125, in <module>
    mp.spawn(main, nprocs=params.ngpus, args=(params,))
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/disk/sde/wzr/MetaPortrait/base_model/main.py", line 118, in main
    train_ddp(args, conf, models, datasets)
  File "/disk/sde/wzr/MetaPortrait/base_model/train_ddp.py", line 114, in train_ddp
    losses_G, generated = G_full(data, stage=args["stage"])
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 787, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely  means that not all `forward` output
s participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporti
ng this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 140 141 142 143
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

my evn is python3.8 torch1.9.1+cu111

wzr0108 commented 1 year ago

I modified th code in tain_ddp.py

with torch.cuda.amp.autocast():
        losses_G, generated = G_full(data, stage=args["stage"])
        loss_G = sum([val.mean() for val in losses_G.values()])
        # avoid ddp bug
        for k, v in generated.items():
            loss_G += v.mean() * 0.0
scaler.scale(loss_G).backward()

There is no error in the code, but I am not sure if the result is correct

ForeverFancy commented 1 year ago

Thanks for pointing this out, I will check it later.

xueziii commented 1 year ago

I want to train a model from scratch, but how should I prepare the training data? I don't know how to get the ldmk, theta, id and map_dict files and put them in the correct place

Meta-Portrait / MetaPortrait

training error #12