Closed rdatasculptor closed 4 years ago
How did you save the model & what is the dimension of the embedding matrix?
In saved it as a .ruimtehol
file, amount of dimensions is 160.
So you built the model on Windows and now loading on Linux? The .ruimtehol files are basically .rds files, which saved the embedding matrix together with the labels if there are any. The only thing I can think of is either a mismatch in encoding or the number of rows (words) in your embedding matrix is really really large or maybe it contains words which are merely "" or similar. Is it possible to share your model?
It is very large indeed. I used a pretrained model as a starting point. The model is almost 1 gb. But then again, it is strange that it works on windows and not on linux. I am not completely sure I can share it, because it contains company info as well. Maybe I can make a new model that is almost as big as my current model, but without extra company information.
note: I used the pretrained "combined" model, with 160 dimensions to be found here: https://github.com/clips/dutchembeddings
this is how I read the pretrained file in:
library(readr)
pretrained <- read_delim("combined-160.txt",delim=" ",skip=1,col_names = FALSE)
namen <- pretrained$X1
pretrained <- as.data.frame(pretrained[c(-1)])
dimnames(pretrained) <- list(namen, 1:160)
pretrained <- as.matrix(pretrained)
After that I trained the model like this:
model <- embed_articlespace(knowledgebase, embeddings=pretrained,
dim = 160, lr = 0.05, epoch = 40,
similarity = "cosine", negSearchLimit = 50, loss="softmax",
maxNegSamples = 3, dropoutRHS = 0.5, adagrad = TRUE, minCount = 10, ngrams=2)
The pretrained file has 1442950 rows.
Ok, looks fine on the training code. No idea on the bad alloc, clearly coming from the starspace c++ code probably due to the number of terms in the model: 1442950. What I suggest you do is limit that. Just get the words which you have in your knowledgebase and reduce the number of terms in your embedding matrix as in pretrained <- pretrained[limitedsetofwords, ]
or look at the Encoding of the words (maybe your Linux / Windows have a different default encoding - latin1/utf-8 maybe)
table(Encoding(rownames(pretrained)))
Thank you for your suggestions! Because I need the words in the pretrained file as well limiting the number of words is not my prefered way to go. Right now I try to train the model on the remote server instead of on my windows machine. Perhaps that could help too?
That would be my first thing to try out as well.
Were all your terms of your pretrained embedding matrix in UTF-8? table(Encoding(rownames(pretrained)))
They were in "unknown".... thinking of that, maybe that will be a problem on the remote server as well.
Not 100% sure but can you make sure in read_delim you set the correct encoding of your terms and convert them if needed to UTF-8.
Let me know if that solved the issue as I'm frequently in the same setting where building models in Windows and deploying on Linux
From which code is this error coming from?
sorry, that was an error that was caused by something silly I did. I deleted it again.
But this one is the real error I get:
Error in (function (model = "textspace.bin", save = FALSE, trainFile = "", :
std::bad_alloc
So, training the model on the remote server gives the same error as loading the windows trained model on the remote server.
Maybe a lack of ram to put the 1.4Mio x 160 matrix in? Error should be traced using the gdb debugger to see where this is coming from
I think you are right. All goes well now.... I will let you know if increasing of ram indeed solved the problem
Okay that solved it. Thank you for pointing me in the right direction. I was not aware of the lack of ram in my R environment on the remote server. Increasing the ram solved my problems.
When I try to load a trained ruimtehol model on a remote server (where I run R), then this error message will appear:
Error in (function (model = "textspace.bin", save = FALSE, trainFile = "", : std::bad_alloc
Any ideas about what I could be doing wrong? Thanks!