Hi, I try to resume my training from intermediate checkpoint file with cfg.MODEL.WEIGHTS & no_resume=False but it didn't work. The checkpointer cannot locate the checkpoint file as there are 8 files for 8 ranks (I trained my model with 8 GPU). I made the following changes in train.py to locate the checkpoint file of each rank:
# Before
start_iter = checkpointer.resume_or_load(cfg.MODEL.WEIGHTS, resume=resume).get("iteration", -1) + 1
# After
from dinov2.fsdp import rankstr
start_iter = checkpointer.resume_or_load(
cfg.MODEL.WEIGHTS.replace("rank_0", rankstr()), # input: model_xxxxxx.rank_0.pth
resume).get("iteration", -1) + 1
Then I can resume training with following command:
However, I got some warning messages tell me that there are some mismatched weights that are not used by the model:
I20240702 20:40:12 1728745 fvcore.common.checkpoint checkpoint.py:150] [Checkpointer] Loading from output/vitb16/model_0116249.rank_0.pth ...
W20240702 20:40:14 1728745 fvcore.common.checkpoint checkpoint.py:352] The checkpoint state_dict contains keys that are not used by the model:
student.backbone._fsdp_wrapped_module._flat_param
student.backbone._fsdp_wrapped_module.blocks.0._fsdp_wrapped_module._flat_param
student.backbone._fsdp_wrapped_module.blocks.1._fsdp_wrapped_module._flat_param
student.backbone._fsdp_wrapped_module.blocks.2._fsdp_wrapped_module._flat_param
student.backbone._fsdp_wrapped_module.blocks.3._fsdp_wrapped_module._flat_param
student.dino_head._fsdp_wrapped_module._flat_param
teacher.backbone._fsdp_wrapped_module._flat_param
teacher.backbone._fsdp_wrapped_module.blocks.0._fsdp_wrapped_module._flat_param
teacher.backbone._fsdp_wrapped_module.blocks.1._fsdp_wrapped_module._flat_param
teacher.backbone._fsdp_wrapped_module.blocks.2._fsdp_wrapped_module._flat_param
teacher.backbone._fsdp_wrapped_module.blocks.3._fsdp_wrapped_module._flat_param
teacher.dino_head._fsdp_wrapped_module._flat_param
Would these mismatched weights affect the training? I just want to make sure about that. Or can I simply ignore it? Thanks.
Hi, I try to resume my training from intermediate checkpoint file with
cfg.MODEL.WEIGHTS
&no_resume=False
but it didn't work. The checkpointer cannot locate the checkpoint file as there are 8 files for 8 ranks (I trained my model with 8 GPU). I made the following changes in train.py to locate the checkpoint file of each rank:Then I can resume training with following command:
However, I got some warning messages tell me that there are some mismatched weights that are not used by the model:
Would these mismatched weights affect the training? I just want to make sure about that. Or can I simply ignore it? Thanks.