Open Taikakim opened 4 months ago
OK, I'm getting the same error now even when passing an unwrapped version of that trained model with the --pretrained-ckpt-path option. This is very curious as in the model config I only changed the LR.
OK, I think I found the reason... When I reverted the sample prompts to what they were in the initial run, the training will resume. This is quite unexpected though.
I was getting the same errors and this issue helped me figure out what's wrong.
You can't change the prompts in the demo section compared to the initial run. Once I reverted the prompts of the demo section in the model_config, the demo generation and training started successfully.
Hi, if I understood correctly, to continue with the 16GB checkpoints the --ckpt-path is the right way to pass the weights. I tried resuming directly after training the base model for some hours, I only changed the LR, added two sample prompts and changed the warmup speed. But got this error:
The size of tensor a (14) must match the size of tensor b (12) at non-singleton dimension 0
The whole cell output