Closed hachreak closed 3 years ago
I tried also to force self.iter value here but without success (at least on 4 gpus, because locally on my computer with a single gpu it looks accept the hardcoded value).
Thanks for pointing this out.
This is because the Detectron2 updates its resume_or_load
function, while ours is based on slightly older version.
You could temporarily try this:
Add this function to the Trainer (such as UBTeacherTrainer).
def resume_or_load(self, resume=True):
"""
If `resume==True` and `cfg.OUTPUT_DIR` contains the last checkpoint (defined by
a `last_checkpoint` file), resume from the file. Resuming means loading all
available states (eg. optimizer and scheduler) and update iteration counter
from the checkpoint. ``cfg.MODEL.WEIGHTS`` will not be used.
Otherwise, this is considered as an independent training. The method will load model
weights from the file `cfg.MODEL.WEIGHTS` (but will not load other states) and start
from iteration 0.
Args:
resume (bool): whether to do resume or not
"""
checkpoint = self.checkpointer.resume_or_load(
self.cfg.MODEL.WEIGHTS, resume=resume
)
if resume and self.checkpointer.has_checkpoint():
self.start_iter = checkpoint.get("iteration", -1) + 1
# The checkpoint stores the training iteration that just finished, thus we start
# at the next iteration (or iter zero if there's no checkpoint).
if isinstance(self.model, DistributedDataParallel):
# broadcast loaded data/model from the first rank, because other
# machines may not have access to the checkpoint file
if TORCH_VERSION >= (1, 7):
self.model._sync_params_and_buffers()
self.start_iter = comm.all_gather(self.start_iter)[0]
Add from detectron2.utils.env import TORCH_VERSION
to trainer.py
Let me know if this works. If not, I will be back to fix this after next week (sorry, I might be very busy in these two weeks). Thanks!
Hi @ycliu93 I will try and I will let you know if it works. Thanks a lot for your help!
Hi @ycliu93
Do I need to do something special to correctly restart the training?
It's enough to have the flag --resume
, the model .pth and the last_checkpoint
file inside the output directory?
Do I need some other files?
Yes, what you mentioned is all you need to resume from a trained weight.
Hi everybody, I tried to run a training on a machine with 4 gpus and I had to stop it before the end. When I tried to resume the training with:
It looks load last model:
But then, the iter value looks restarting from zero. Is it normal? Because e.g. the configuration
BURN_UP_STEP
probably is affected by this. Somebody has the same issue?