How do I resume training after an unexpected interruption in training?

liyunlongaaa commented 2 years ago

Hi friend, Im a newer. For some reason, I can only train with a laptop, but halfway through the training, the computer restarts because the temperature is too high, how should I continue training? thank you for your help!

YuanGongND commented 2 years ago

Hi there,

Did the machine finish the first epoch? If so, you should be able to find the saved checkpoint in the experiment path. In addition, when you train with a large dataset (with more than 200k samples), the script also saves the optimizer states. https://github.com/YuanGongND/ast/blob/87a80043154eb4bb34ebceb4dc3e2d91a99235f4/src/traintest.py#L210-L216

The training progress is also saved at https://github.com/YuanGongND/ast/blob/87a80043154eb4bb34ebceb4dc3e2d91a99235f4/src/traintest.py#L39-L43

You should be able to use above and torch.load and then torch.dataparallel to load the model and continue training, but we do not have an interface for continue training in this repo.

For training with lower computational overhead, you could consider (1) fine-tune our audioset pretrained model on your dataset, please check the ESC-50 recipe, and/or (2) using a smaller/no overlap in patch split, i.e., setting fstride=16 and tstride=16 when you instantiate the AST model.

-Yuan

liyunlongaaa commented 2 years ago

wow, thank you very much !! love you

YuanGongND / ast

How do I resume training after an unexpected interruption in training? #70