How to resume my previous training?

NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)

https://research.nvidia.com/labs/dir/neuralangelo/

Other

4.31k stars 387 forks source link

How to resume my previous training? #123

Open pierreparfait01 opened 11 months ago

pierreparfait01 commented 11 months ago

From the document I supposed that I should add --resume before rerun my training, but after I start it'll just write train from scratch

torchrun --nproc_per_node=${GPUS} train.py \ --logdir=logs/${GROUP}/${NAME} \ --config=${CONFIG} \ --show_pbar --resume

How do I use resume my train exactly? Or it's supposed to write Train from scratch?

chenhsuanlin commented 11 months ago

Hi @pierreparfait01, has a checkpoint ever been saved? Just adding --resume should make the training resume from the latest checkpoint. Otherwise, you could specify --checkpoint={CHECKPOINT_PATH} as mentioned in the README.

pierreparfait01 commented 11 months ago

Yes, checkpoint has been saved, but still train from scratch every single time. Even if I use the --checkpoint={CHECKPOINT_PATH} still train from scratch and I tested it's indeed trained all over again.

Edit: So I manage to get it work, for some reason the checkpoint wouldn't load if the commend is placed after --show_pbar. But here comes another issue(bug?) is that if I placed --resume before --show_pbar then the --show_pbar wouldn't load.

Jaydentlee commented 4 months ago

Hi, I encountered the same problem as yours and figured out why. you need to add a \ at the end of the last line before you add a new line of --resume