facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
9k stars 792 forks source link

Resume training from intermediate checkpoint? #436

Open JiarunLiu opened 3 months ago

JiarunLiu commented 3 months ago

Hi, I try to resume my training from intermediate checkpoint file with cfg.MODEL.WEIGHTS & no_resume=False but it didn't work. The checkpointer cannot locate the checkpoint file as there are 8 files for 8 ranks (I trained my model with 8 GPU). I made the following changes in train.py to locate the checkpoint file of each rank:

# Before
start_iter = checkpointer.resume_or_load(cfg.MODEL.WEIGHTS, resume=resume).get("iteration", -1) + 1

# After
from dinov2.fsdp import rankstr
start_iter = checkpointer.resume_or_load(
    cfg.MODEL.WEIGHTS.replace("rank_0", rankstr()),  # input: model_xxxxxx.rank_0.pth
    resume).get("iteration", -1) + 1

Then I can resume training with following command:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 && export PYTHONPATH=. && python -m torch.distributed.launch --nproc_per_node=8 --master_port=12366 dinov2/train/train.py \
    --config-file=dinov2/configs/train/vitb16.yaml \
    --output-dir output/vitb16 \
    train.dataset_path=MyDataSet:root=/path/to/my/dataset:split=train \
    train.batch_size_per_gpu=256 \
    MODEL.WEIGHTS=output/vitb16_old/model_0116249.rank_0.pth \
    no_resume=False

However, I got some warning messages tell me that there are some mismatched weights that are not used by the model:

I20240702 20:40:12 1728745 fvcore.common.checkpoint checkpoint.py:150] [Checkpointer] Loading from output/vitb16/model_0116249.rank_0.pth ...
W20240702 20:40:14 1728745 fvcore.common.checkpoint checkpoint.py:352] The checkpoint state_dict contains keys that are not used by the model:
  student.backbone._fsdp_wrapped_module._flat_param
  student.backbone._fsdp_wrapped_module.blocks.0._fsdp_wrapped_module._flat_param
  student.backbone._fsdp_wrapped_module.blocks.1._fsdp_wrapped_module._flat_param
  student.backbone._fsdp_wrapped_module.blocks.2._fsdp_wrapped_module._flat_param
  student.backbone._fsdp_wrapped_module.blocks.3._fsdp_wrapped_module._flat_param
  student.dino_head._fsdp_wrapped_module._flat_param
  teacher.backbone._fsdp_wrapped_module._flat_param
  teacher.backbone._fsdp_wrapped_module.blocks.0._fsdp_wrapped_module._flat_param
  teacher.backbone._fsdp_wrapped_module.blocks.1._fsdp_wrapped_module._flat_param
  teacher.backbone._fsdp_wrapped_module.blocks.2._fsdp_wrapped_module._flat_param
  teacher.backbone._fsdp_wrapped_module.blocks.3._fsdp_wrapped_module._flat_param
  teacher.dino_head._fsdp_wrapped_module._flat_param

Would these mismatched weights affect the training? I just want to make sure about that. Or can I simply ignore it? Thanks.

rickmorty321 commented 2 months ago

i am using something very similar to resume the training but I don't get any size mismatch.