SuLvXiangXin / zipnerf-pytorch

Unofficial implementation of ZipNeRF
Apache License 2.0
802 stars 87 forks source link

checkpointing system bug? #78

Open hecodeit opened 1 year ago

hecodeit commented 1 year ago

Training with checkpoint cinfiguration:

accelerate launch train.py \
    --gin_configs=configs/360.gin \
    --gin_bindings="Config.data_dir = '${DATA_DIR}'" \
    --gin_bindings="Config.exp_name = '${EXP_NAME}'" \
    --gin_bindings="Config.factor = 4" \
    --gin_bindings="Config.checkpoint_every = 1000"

As kill the process and restart training, log showing checkpoint loaded successfully. The training start from "0/25000" step? It's a bug?

2023-08-18 15:37:51: Resuming from checkpoint exp/dozer/checkpoints/001000
2023-08-18 15:37:51: Loading states from exp/dozer/checkpoints/001000
2023-08-18 15:37:51: All model weights loaded successfully
2023-08-18 15:37:52: All optimizer states loaded successfully
2023-08-18 15:37:52: All scheduler states loaded successfully
2023-08-18 15:37:52: GradScaler state loaded successfully
2023-08-18 15:37:52: All random states loaded successfully
2023-08-18 15:37:52: Loading in 0 custom states
2023-08-18 15:37:52: Number of parameters being optimized: 77622581
2023-08-18 15:37:52: Begin training...
2023-08-18 15:37:54: 1/25000:loss=0.05934,psnr=20.902,lr=3.14e-06 | data=0.05782,anti=2.2e-05,dist=0.00114,hash=0.00036,4660495 r/s
2023-08-18 15:37:54: Reducer buckets have been rebuilt in this iteration.
Training:   0%|                        | 49/25000 [00:47<6:30:51,  1.06it/s]