Pre-trained optimizer mismatch with the model when resuming training

zichen34 commented 1 year ago

Dear authors. Thank you for your great work first.

I would like to resume the training based on your provided pre-trained models. So I run the following command: python train.py --config configs/gnt_llff.txt --ckpt_path=./trex_model_300000.pth --train_scenes trex --eval_scenes trex --expname resume_trex --chunk_size 500 --N_samples 20

But there will be a RuntimeError: The size of tensor a (64) must match the size of tensor b (4) at non-singleton dimension 1 when executing this line: https://github.com/VITA-Group/GNT/blob/33a99a9cfb110c6d5de124684f4aa6ab930ea4ae/train.py#L144

The following is the Trackback:

outputs will be saved to ./out/resume_trex
training dataset: llff_test
loading ['trex'] for train
loading ['trex'] for validation
Reloading from ./trex_model_300000.pth, starting at step=300000
Traceback (most recent call last):
  File "/home/rt/Downloads/GNT/train.py", line 319, in <module>
    train(args)
  File "/home/rt/Downloads/GNT/train.py", line 144, in train
    model.optimizer.step()
  File "/home/rt/anaconda3/envs/GNT/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/rt/anaconda3/envs/GNT/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/rt/anaconda3/envs/GNT/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/rt/anaconda3/envs/GNT/lib/python3.9/site-packages/torch/optim/adam.py", line 157, in step
    adam(params_with_grad,
  File "/home/rt/anaconda3/envs/GNT/lib/python3.9/site-packages/torch/optim/adam.py", line 213, in adam
    func(params,
  File "/home/rt/anaconda3/envs/GNT/lib/python3.9/site-packages/torch/optim/adam.py", line 262, in _single_tensor_adam
    exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: The size of tensor a (64) must match the size of tensor b (4) at non-singleton dimension 1

By checking the pre-trained optimizer, I found the ['optimizer']['state'] of some layers have a mismatched shape with their corresponding layers. For example, the optimizer state belonging to the 15th layer is a tensor with the shape of torch.Size([64, 64]), but the 15th layer of the GNT model has a dimension of torch.Size([8, 4]).

And further, I realized the problem is that the layer sequences in the pre-trained optimizer and initialized GNT model are different. For this problem, I am not sure whether this issue may apply.

I don't know what caused this disorder. Hope you can help me.

MukundVarmaT commented 1 year ago

Hi @Raspberrycai1, thank you for your interest in our work! and apologies for the delay in our response.

The released checkpoints were trained from a private codebase, which I cleaned for public release. During the cleaning, I renamed a layer for better readability but this led to a naming mismatch with the checkpoints. I have fixed this and can confirm that it works from my end. Please let me know in case you run into further trouble!

Thanks

zichen34 commented 1 year ago

Yes, it works. Thanks a lot for your effort.

VITA-Group / GNT

Pre-trained optimizer mismatch with the model when resuming training #7