self.iter restart from zero when I resume a training

hachreak commented 3 years ago

Hi everybody, I tried to run a training on a machine with 4 gpus and I had to stop it before the end. When I tried to resume the training with:

python train_net.py --resume --num-gpus 4 --config myconfig.yaml MODEL.WEIGHTS model_0039999.pth

It looks load last model:

[05/17 23:19:48 fvcore.common.checkpoint]: [Checkpointer] Loading from output/v1/model_0039999.pth ...
[05/17 23:19:50 fvcore.common.checkpoint]: Loading optimizer from output/v1/model_0039999.pth ...
[05/17 23:19:50 fvcore.common.checkpoint]: Loading scheduler from output/v1/model_0039999.pth ...
[05/17 23:20:47 d2.utils.events]:  eta: 6 days, 8:27:02  iter: 19  total_loss: 0.4911  loss_cls: 0.1124  loss_box_reg: 0.2631  loss_rpn_cls: 0.04196  loss_rpn_loc: 0.06917  time: 1.5732  data_time: 1.5957  lr: 0.005  max_mem: 9390M
[05/17 23:21:18 d2.utils.events]:  eta: 6 days, 11:02:18  iter: 39  total_loss: 0.4788  loss_cls: 0.1176  loss_box_reg: 0.2795  loss_rpn_cls: 0.03516  loss_rpn_loc: 0.06627  time: 1.5558  data_time: 0.6720  lr: 0.005  max_mem: 11201M

But then, the iter value looks restarting from zero. Is it normal? Because e.g. the configuration BURN_UP_STEP probably is affected by this. Somebody has the same issue?

hachreak commented 3 years ago

I tried also to force self.iter value here but without success (at least on 4 gpus, because locally on my computer with a single gpu it looks accept the hardcoded value).

ycliu93 commented 3 years ago

Thanks for pointing this out. This is because the Detectron2 updates its resume_or_load function, while ours is based on slightly older version.

You could temporarily try this:

Add this function to the Trainer (such as UBTeacherTrainer).

def resume_or_load(self, resume=True):
    """
    If `resume==True` and `cfg.OUTPUT_DIR` contains the last checkpoint (defined by
    a `last_checkpoint` file), resume from the file. Resuming means loading all
    available states (eg. optimizer and scheduler) and update iteration counter
    from the checkpoint. ``cfg.MODEL.WEIGHTS`` will not be used.
    Otherwise, this is considered as an independent training. The method will load model
    weights from the file `cfg.MODEL.WEIGHTS` (but will not load other states) and start
    from iteration 0.
    Args:
        resume (bool): whether to do resume or not
    """
    checkpoint = self.checkpointer.resume_or_load(
        self.cfg.MODEL.WEIGHTS, resume=resume
    )
    if resume and self.checkpointer.has_checkpoint():
        self.start_iter = checkpoint.get("iteration", -1) + 1
        # The checkpoint stores the training iteration that just finished, thus we start
        # at the next iteration (or iter zero if there's no checkpoint).
    if isinstance(self.model, DistributedDataParallel):
        # broadcast loaded data/model from the first rank, because other
        # machines may not have access to the checkpoint file
        if TORCH_VERSION >= (1, 7):
            self.model._sync_params_and_buffers()
        self.start_iter = comm.all_gather(self.start_iter)[0]

Add from detectron2.utils.env import TORCH_VERSION to trainer.py

Let me know if this works. If not, I will be back to fix this after next week (sorry, I might be very busy in these two weeks). Thanks!

hachreak commented 3 years ago

Hi @ycliu93 I will try and I will let you know if it works. Thanks a lot for your help!

hachreak commented 3 years ago

Hi @ycliu93 Do I need to do something special to correctly restart the training? It's enough to have the flag --resume, the model .pth and the last_checkpoint file inside the output directory? Do I need some other files?

ycliu93 commented 3 years ago

Yes, what you mentioned is all you need to resume from a trained weight.

facebookresearch / unbiased-teacher

self.iter restart from zero when I resume a training #23