Open Rayiz3 opened 6 days ago
I didn't encounter this issue before. You mentioned that the problem happened in the sync_params(), can you try to remove this for a quick workaround? Since the stage 2 only utilizes a single GPU for training. I will check the codes in the future to see what caused the problem when I get more time. Thank you!
I was actually working on Windows, and when I try to do this on WSL, it works well. Thank you!
Hi, thank you for providing the opensource code.
While doing stage 2 training, mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2
the code gave me an error: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().
But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).
def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)
Can you give me some advice to mange this problem?
Thank you.