embed_tagspace produces different results within a session and when loaded (starspace_load_model) if ngrams is used

nmanca commented 3 years ago

This strange behaviour happened in my project but I have tested it even with the dekamer example. Everything is fine untill the model (embed_tagspace) is trained specifing the ngrams parameters, saved and reloaded, no matter the method I use. I noticed the inconsistencies in the predict results: the similarities obtained are on a different scale and the arrangement of the label scored is different. I specify that the method used to save and load the trained model affects the predict results, but always in a non consistent way with the model object trained in session.

jwijffels commented 3 years ago

can you provide a reproducible example?

nmanca commented 3 years ago

library(ruimtehol)

data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text, 
                        y = traindata$question_theme_main, 
                        early_stopping = 0.8,
                        dim = 10, minCount = 5,
                        loss = 'hinge',
                        similarity = 'dot',
                        negSearchLimit = 50,
                        ngrams = 4 ###removing this line should produce consistent results
                        )

pred1 = predict(model, testdata[1,])

starspace_save_model(model, 'prova_dekamer')
prova_mod_load = starspace_load_model('prova_dekamer')

pred_1_load = predict(prova_mod_load, testdata[1,])

jwijffels commented 3 years ago

Thanks for the report! Indeed unexpected. I think I just shot myself in my foot as well with this. Will look into it.

jwijffels commented 3 years ago

TODO

ngrams is correctly loaded when doing starspace_load_model CHECKED: IS OK
check here https://github.com/bnosac/ruimtehol/blob/c567706e06c62ea44c20256ab80e59913ed2821f/src/rcpp_textspace.cpp#L499 if ngrams is still there CHECKED: IS OK
check here https://github.com/bnosac/ruimtehol/blob/c567706e06c62ea44c20256ab80e59913ed2821f/src/rcpp_textspace.cpp#L523 if ngrams is correctly used instead of the default 1
or maybe because of the n-gram hashing bucket

jwijffels commented 3 years ago

Probably the buckets: https://github.com/bnosac/ruimtehol/blob/aeea341b0006fbba4101cefb5f43031a304dd10e/src/Starspace/src/model.cpp#L82 We never update the buckets when reloading the embeddings from file: https://github.com/bnosac/ruimtehol/blob/master/src/rcpp_textspace.cpp#L323

jwijffels commented 3 years ago

Looks like I forgot to save the embeddings of the 2000000 hashed buckets

jwijffels commented 3 years ago

Yes, this is indeed the reason, forgot to save the hashed buckets when saving with type ruimtehol, if you save with type binary, these buckets are still there as the LHS and RHS embeddings are saved https://github.com/bnosac/ruimtehol/blob/aeea341b0006fbba4101cefb5f43031a304dd10e/src/Starspace/src/model.cpp#L838 and everythings works fine then.

library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, FUN = function(x) paste(x, collapse = " "))
set.seed(123456789)
idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata  <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text,
                        y = traindata$question_theme_main,
                        early_stopping = 0.8,
                        dim = 10, minCount = 5,
                        loss = 'hinge',
                        negSearchLimit = 50,
                        ngrams = 3, bucket = 10
)

> ## no save
> txt <- "Cijfers verkrachting tussen partners In"
> predict(model, txt, type = "embedding")
       [,1]      [,2]     [,3]         [,4]       [,5]        [,6]      [,7]       [,8]       [,9]     [,10]
1 0.3062341 -0.129529 0.375498 0.0009290021 -0.2719682 -0.05267627 0.4873475 -0.4883637 0.07587066 0.4358196
> ## save & reload with type ruimtehol
> starspace_save_model(model, 'prova_dekamer', method = "ruimtehol")
> set.seed(123456789)
> prova_mod_load = starspace_load_model('prova_dekamer', method = "ruimtehol")
> predict(prova_mod_load, txt, type = "embedding")
       [,1]       [,2]      [,3]       [,4]       [,5]       [,6]      [,7]       [,8]      [,9]     [,10]
1 0.2382428 -0.3666292 0.3345218 -0.0092996 -0.3742568 -0.1522302 0.3492487 -0.4678786 0.1254112 0.4206861
> ## save & reload with type binary
> starspace_save_model(model, 'prova_dekamer', method = "binary")
Saving model to file : prova_dekamer
> prova_mod_load = ruimtehol:::textspace_load_model('prova_dekamer', is_tsv = FALSE)
Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.
> class(prova_mod_load) <- "textspace"
> predict(prova_mod_load, txt, type = "embedding")
       [,1]       [,2]     [,3]         [,4]       [,5]        [,6]      [,7]       [,8]       [,9]     [,10]
1 0.3062338 -0.1295296 0.375498 0.0009292171 -0.2719682 -0.05267647 0.4873469 -0.4883641 0.07587058 0.4358199

jwijffels commented 3 years ago

Just a note to myself AAAAAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRRGGGGGGGGHHHHHHHHHHHHHHHH

jwijffels commented 3 years ago

Hello @nmanca

I've fixed this in commit https://github.com/bnosac/ruimtehol/commit/96f02f4f5c2dea1d155770e88d114f3c3f694514, such that saving a model with ngrams > 1 using the default method now works as expected
The embeddings of the buckets (Starspace uses the hashing trick to map ngrams which can be quite huge upon a limited set of hashes) were not saved when using method = 'ruimtehol' - which is the default, they were saved using method = 'binary'. Hence when you reloaded, it resorted to small random values for the embeddings of the n-grams, unless you had saved your model with method 'binary'. Meaning it focussed mainly on the unigrams.
The default bucket argument results now in an embedding of 100000 x dim instead of 2000000 x dim

Please test out.

jwijffels commented 3 years ago

The fix has been released on CRAN as well. Closing. Feel free to reopen if needed.

nmanca commented 3 years ago

Awesome, thank you very much @jwijffels

bnosac / ruimtehol

embed_tagspace produces different results within a session and when loaded (starspace_load_model) if ngrams is used #31