bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

embed_tagspace produces different results within a session and when loaded (starspace_load_model) if ngrams is used #31

Closed nmanca closed 3 years ago

nmanca commented 3 years ago

This strange behaviour happened in my project but I have tested it even with the dekamer example. Everything is fine untill the model (embed_tagspace) is trained specifing the ngrams parameters, saved and reloaded, no matter the method I use. I noticed the inconsistencies in the predict results: the similarities obtained are on a different scale and the arrangement of the label scored is different. I specify that the method used to save and load the trained model affects the predict results, but always in a non consistent way with the model object trained in session.

jwijffels commented 3 years ago

can you provide a reproducible example?

nmanca commented 3 years ago
library(ruimtehol)

data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text, 
                        y = traindata$question_theme_main, 
                        early_stopping = 0.8,
                        dim = 10, minCount = 5,
                        loss = 'hinge',
                        similarity = 'dot',
                        negSearchLimit = 50,
                        ngrams = 4 ###removing this line should produce consistent results
                        )

pred1 = predict(model, testdata[1,])

starspace_save_model(model, 'prova_dekamer')
prova_mod_load = starspace_load_model('prova_dekamer')

pred_1_load = predict(prova_mod_load, testdata[1,])
jwijffels commented 3 years ago

Thanks for the report! Indeed unexpected. I think I just shot myself in my foot as well with this. Will look into it.

jwijffels commented 3 years ago

TODO

  1. ngrams is correctly loaded when doing starspace_load_model CHECKED: IS OK
  2. check here https://github.com/bnosac/ruimtehol/blob/c567706e06c62ea44c20256ab80e59913ed2821f/src/rcpp_textspace.cpp#L499 if ngrams is still there CHECKED: IS OK
  3. check here https://github.com/bnosac/ruimtehol/blob/c567706e06c62ea44c20256ab80e59913ed2821f/src/rcpp_textspace.cpp#L523 if ngrams is correctly used instead of the default 1
  4. or maybe because of the n-gram hashing bucket
jwijffels commented 3 years ago

Probably the buckets: https://github.com/bnosac/ruimtehol/blob/aeea341b0006fbba4101cefb5f43031a304dd10e/src/Starspace/src/model.cpp#L82 We never update the buckets when reloading the embeddings from file: https://github.com/bnosac/ruimtehol/blob/master/src/rcpp_textspace.cpp#L323

jwijffels commented 3 years ago

Looks like I forgot to save the embeddings of the 2000000 hashed buckets

jwijffels commented 3 years ago

Yes, this is indeed the reason, forgot to save the hashed buckets when saving with type ruimtehol, if you save with type binary, these buckets are still there as the LHS and RHS embeddings are saved https://github.com/bnosac/ruimtehol/blob/aeea341b0006fbba4101cefb5f43031a304dd10e/src/Starspace/src/model.cpp#L838 and everythings works fine then.

library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, FUN = function(x) paste(x, collapse = " "))
set.seed(123456789)
idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata  <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text,
                        y = traindata$question_theme_main,
                        early_stopping = 0.8,
                        dim = 10, minCount = 5,
                        loss = 'hinge',
                        negSearchLimit = 50,
                        ngrams = 3, bucket = 10
)
> ## no save
> txt <- "Cijfers verkrachting tussen partners In"
> predict(model, txt, type = "embedding")
       [,1]      [,2]     [,3]         [,4]       [,5]        [,6]      [,7]       [,8]       [,9]     [,10]
1 0.3062341 -0.129529 0.375498 0.0009290021 -0.2719682 -0.05267627 0.4873475 -0.4883637 0.07587066 0.4358196
> ## save & reload with type ruimtehol
> starspace_save_model(model, 'prova_dekamer', method = "ruimtehol")
> set.seed(123456789)
> prova_mod_load = starspace_load_model('prova_dekamer', method = "ruimtehol")
> predict(prova_mod_load, txt, type = "embedding")
       [,1]       [,2]      [,3]       [,4]       [,5]       [,6]      [,7]       [,8]      [,9]     [,10]
1 0.2382428 -0.3666292 0.3345218 -0.0092996 -0.3742568 -0.1522302 0.3492487 -0.4678786 0.1254112 0.4206861
> ## save & reload with type binary
> starspace_save_model(model, 'prova_dekamer', method = "binary")
Saving model to file : prova_dekamer
> prova_mod_load = ruimtehol:::textspace_load_model('prova_dekamer', is_tsv = FALSE)
Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.
> class(prova_mod_load) <- "textspace"
> predict(prova_mod_load, txt, type = "embedding")
       [,1]       [,2]     [,3]         [,4]       [,5]        [,6]      [,7]       [,8]       [,9]     [,10]
1 0.3062338 -0.1295296 0.375498 0.0009292171 -0.2719682 -0.05267647 0.4873469 -0.4883641 0.07587058 0.4358199
jwijffels commented 3 years ago

Just a note to myself AAAAAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRRGGGGGGGGHHHHHHHHHHHHHHHH

jwijffels commented 3 years ago

Hello @nmanca

Please test out.

jwijffels commented 3 years ago

The fix has been released on CRAN as well. Closing. Feel free to reopen if needed.

nmanca commented 3 years ago

Awesome, thank you very much @jwijffels