Closed nmanca closed 3 years ago
can you provide a reproducible example?
library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text,
FUN = function(x) paste(x, collapse = " "))
idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text,
y = traindata$question_theme_main,
early_stopping = 0.8,
dim = 10, minCount = 5,
loss = 'hinge',
similarity = 'dot',
negSearchLimit = 50,
ngrams = 4 ###removing this line should produce consistent results
)
pred1 = predict(model, testdata[1,])
starspace_save_model(model, 'prova_dekamer')
prova_mod_load = starspace_load_model('prova_dekamer')
pred_1_load = predict(prova_mod_load, testdata[1,])
Thanks for the report! Indeed unexpected. I think I just shot myself in my foot as well with this. Will look into it.
TODO
Probably the buckets: https://github.com/bnosac/ruimtehol/blob/aeea341b0006fbba4101cefb5f43031a304dd10e/src/Starspace/src/model.cpp#L82 We never update the buckets when reloading the embeddings from file: https://github.com/bnosac/ruimtehol/blob/master/src/rcpp_textspace.cpp#L323
Looks like I forgot to save the embeddings of the 2000000 hashed buckets
Yes, this is indeed the reason, forgot to save the hashed buckets when saving with type ruimtehol, if you save with type binary, these buckets are still there as the LHS and RHS embeddings are saved https://github.com/bnosac/ruimtehol/blob/aeea341b0006fbba4101cefb5f43031a304dd10e/src/Starspace/src/model.cpp#L838 and everythings works fine then.
library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, FUN = function(x) paste(x, collapse = " "))
set.seed(123456789)
idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text,
y = traindata$question_theme_main,
early_stopping = 0.8,
dim = 10, minCount = 5,
loss = 'hinge',
negSearchLimit = 50,
ngrams = 3, bucket = 10
)
> ## no save
> txt <- "Cijfers verkrachting tussen partners In"
> predict(model, txt, type = "embedding")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
1 0.3062341 -0.129529 0.375498 0.0009290021 -0.2719682 -0.05267627 0.4873475 -0.4883637 0.07587066 0.4358196
> ## save & reload with type ruimtehol
> starspace_save_model(model, 'prova_dekamer', method = "ruimtehol")
> set.seed(123456789)
> prova_mod_load = starspace_load_model('prova_dekamer', method = "ruimtehol")
> predict(prova_mod_load, txt, type = "embedding")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
1 0.2382428 -0.3666292 0.3345218 -0.0092996 -0.3742568 -0.1522302 0.3492487 -0.4678786 0.1254112 0.4206861
> ## save & reload with type binary
> starspace_save_model(model, 'prova_dekamer', method = "binary")
Saving model to file : prova_dekamer
> prova_mod_load = ruimtehol:::textspace_load_model('prova_dekamer', is_tsv = FALSE)
Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.
> class(prova_mod_load) <- "textspace"
> predict(prova_mod_load, txt, type = "embedding")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
1 0.3062338 -0.1295296 0.375498 0.0009292171 -0.2719682 -0.05267647 0.4873469 -0.4883641 0.07587058 0.4358199
Just a note to myself AAAAAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRRGGGGGGGGHHHHHHHHHHHHHHHH
Hello @nmanca
I've fixed this in commit https://github.com/bnosac/ruimtehol/commit/96f02f4f5c2dea1d155770e88d114f3c3f694514, such that saving a model with ngrams > 1 using the default method now works as expected
The embeddings of the buckets (Starspace uses the hashing trick to map ngrams which can be quite huge upon a limited set of hashes) were not saved when using method = 'ruimtehol' - which is the default, they were saved using method = 'binary'. Hence when you reloaded, it resorted to small random values for the embeddings of the n-grams, unless you had saved your model with method 'binary'. Meaning it focussed mainly on the unigrams.
The default bucket argument results now in an embedding of 100000 x dim instead of 2000000 x dim
Please test out.
The fix has been released on CRAN as well. Closing. Feel free to reopen if needed.
Awesome, thank you very much @jwijffels
This strange behaviour happened in my project but I have tested it even with the dekamer example. Everything is fine untill the model (embed_tagspace) is trained specifing the ngrams parameters, saved and reloaded, no matter the method I use. I noticed the inconsistencies in the predict results: the similarities obtained are on a different scale and the arrangement of the label scored is different. I specify that the method used to save and load the trained model affects the predict results, but always in a non consistent way with the model object trained in session.