bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

Running ruimtehol on R server #29

Closed rdatasculptor closed 4 years ago

rdatasculptor commented 4 years ago

When I try to load a trained ruimtehol model on a remote server (where I run R), then this error message will appear:

Error in (function (model = "textspace.bin", save = FALSE, trainFile = "", : std::bad_alloc

Any ideas about what I could be doing wrong? Thanks!

jwijffels commented 4 years ago

How did you save the model & what is the dimension of the embedding matrix?

rdatasculptor commented 4 years ago

In saved it as a .ruimtehol file, amount of dimensions is 160.

jwijffels commented 4 years ago

So you built the model on Windows and now loading on Linux? The .ruimtehol files are basically .rds files, which saved the embedding matrix together with the labels if there are any. The only thing I can think of is either a mismatch in encoding or the number of rows (words) in your embedding matrix is really really large or maybe it contains words which are merely "" or similar. Is it possible to share your model?

rdatasculptor commented 4 years ago

It is very large indeed. I used a pretrained model as a starting point. The model is almost 1 gb. But then again, it is strange that it works on windows and not on linux. I am not completely sure I can share it, because it contains company info as well. Maybe I can make a new model that is almost as big as my current model, but without extra company information.

rdatasculptor commented 4 years ago

note: I used the pretrained "combined" model, with 160 dimensions to be found here: https://github.com/clips/dutchembeddings

jwijffels commented 4 years ago
  1. What was the dimension of the model you got from https://github.com/clips/dutchembeddings (how many rows have you taken?)
  2. In which setting did you build your model? Was it in a transfer learning setting where you passed on the embedding matrix you got from https://github.com/clips/dutchembeddings . In that case, did you provide also the embeddings of labels if you had any? What was your training script?
rdatasculptor commented 4 years ago

this is how I read the pretrained file in:

library(readr)
pretrained <- read_delim("combined-160.txt",delim=" ",skip=1,col_names = FALSE)
namen <- pretrained$X1
pretrained <- as.data.frame(pretrained[c(-1)])
dimnames(pretrained) <- list(namen, 1:160)
pretrained <- as.matrix(pretrained)

After that I trained the model like this:

model <- embed_articlespace(knowledgebase, embeddings=pretrained, 
                            dim = 160, lr = 0.05, epoch = 40,
                            similarity = "cosine", negSearchLimit = 50, loss="softmax",
                            maxNegSamples = 3, dropoutRHS = 0.5, adagrad = TRUE, minCount = 10, ngrams=2) 

The pretrained file has 1442950 rows.

jwijffels commented 4 years ago

Ok, looks fine on the training code. No idea on the bad alloc, clearly coming from the starspace c++ code probably due to the number of terms in the model: 1442950. What I suggest you do is limit that. Just get the words which you have in your knowledgebase and reduce the number of terms in your embedding matrix as in pretrained <- pretrained[limitedsetofwords, ] or look at the Encoding of the words (maybe your Linux / Windows have a different default encoding - latin1/utf-8 maybe) table(Encoding(rownames(pretrained)))

rdatasculptor commented 4 years ago

Thank you for your suggestions! Because I need the words in the pretrained file as well limiting the number of words is not my prefered way to go. Right now I try to train the model on the remote server instead of on my windows machine. Perhaps that could help too?

jwijffels commented 4 years ago

That would be my first thing to try out as well. Were all your terms of your pretrained embedding matrix in UTF-8? table(Encoding(rownames(pretrained)))

rdatasculptor commented 4 years ago

They were in "unknown".... thinking of that, maybe that will be a problem on the remote server as well.

jwijffels commented 4 years ago

Not 100% sure but can you make sure in read_delim you set the correct encoding of your terms and convert them if needed to UTF-8.

jwijffels commented 4 years ago

Let me know if that solved the issue as I'm frequently in the same setting where building models in Windows and deploying on Linux

jwijffels commented 4 years ago

From which code is this error coming from?

rdatasculptor commented 4 years ago

sorry, that was an error that was caused by something silly I did. I deleted it again.

But this one is the real error I get:

Error in (function (model = "textspace.bin", save = FALSE, trainFile = "",  :
  std::bad_alloc

So, training the model on the remote server gives the same error as loading the windows trained model on the remote server.

jwijffels commented 4 years ago

Maybe a lack of ram to put the 1.4Mio x 160 matrix in? Error should be traced using the gdb debugger to see where this is coming from

rdatasculptor commented 4 years ago

I think you are right. All goes well now.... I will let you know if increasing of ram indeed solved the problem

rdatasculptor commented 4 years ago

Okay that solved it. Thank you for pointing me in the right direction. I was not aware of the lack of ram in my R environment on the remote server. Increasing the ram solved my problems.