bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Open lukashaenjes opened 3 years ago

lukashaenjes commented 3 years ago

Hi, first of all, many thanks for this outstanding package.

I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.

As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.

My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:

Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.

But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5

Could this be the cause of the problem?

Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.

Many thanks in advance!

jwijffels commented 3 years ago

I never did this but I think you can just do saveEveryEpoch = TRUE And next time you want to train again you need to load the model and get the embeddings

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
embeddings <- as.matrix(x)

and next pass on the embeddings to embed_wordspace(..., embeddings = embeddings) or directly to starspace starspace(..., embeddings = embeddings) Transfer learning is shown in section 5 of the package vignette: https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

lukashaenjes commented 3 years ago

Thanks a lot for your fast response! I'll give this a try.