About stage 2 training - Githubissues

Rayiz3 commented 6 days ago

Hi, thank you for providing the opensource code.

While doing stage 2 training, mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2

the code gave me an error: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().

But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)

Can you give me some advice to mange this problem?

Thank you.

zh-ding commented 5 days ago

I didn't encounter this issue before. You mentioned that the problem happened in the sync_params(), can you try to remove this for a quick workaround? Since the stage 2 only utilizes a single GPU for training. I will check the codes in the future to see what caused the problem when I get more time. Thank you!

Rayiz3 commented 5 days ago

I was actually working on Windows, and when I try to do this on WSL, it works well. Thank you!

adobe-research / diffusion-rig

About stage 2 training #13