Can I resume run from checkpoint? My cluster has max run time of 48 hours and resuming from checkpoint is only option to finish

bytedance / HLLM

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

Apache License 2.0

183 stars 21 forks source link

Can I resume run from checkpoint? My cluster has max run time of 48 hours and resuming from checkpoint is only option to finish #23

Closed lixali closed 4 hours ago

lixali commented 4 hours ago

Is there a script to resume training from checkpoint? The weights are stored in "pt" files. Is the optimizer states saved somewhere?

It will be very helpful if the code supports resuming from checkpoint.

ssyzeChen commented 4 hours ago

Sorry, the pretrained weights only contain the model parameters, we do not save optimizers after training 😢. You could try to just load the weights and continue to finetune the model.