Closed ruslankotl closed 1 year ago
Hi, by default the model save checkpoint automatically under the output directory (--output_dir). For example, you may see:
-- model/
| -- checkpoint-10000/
| -- runs/
Here checkpoint-10000/
folder contains all the necessary files to resume a training (if the training has been accidentally interrupted). To resume the training, all you need to do us to pass the path to --pretrain
augment. For example:
t5chem train --data_dir data/sample/product/ --output_dir model/ --pretrain model/checkpoint-10000/ --task_type product --num_epoch 30
Then the training will be resumed from the 10000th step. Hope it helps!
Hi,
I was wondering if the T5Chem CLI had a provision to resume training from a checkpoint?