bnosac / doc2vec

Distributed Representations of Sentences and Documents
Other
46 stars 5 forks source link

predict.paragraph2vec crashes with words greater than 103 chars long #20

Closed Ingolifs closed 3 years ago

Ingolifs commented 3 years ago

It took me a little while to hunt down the cause of this crash...

It does this on my machine at the very least. This is on R 3.6.3.


library(doc2vec)

corpus <-data.frame(doc_id=1,text="here are some words for training the model")
model <- paragraph2vec(x = corpus, type = "PV-DM", dim = 10 , iter = 20,min_count=1)

# this text will successfully run
successtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(successtext)
predict(model, newdata = list(a=successtext), type = "embedding", which = "docs")

# this text will cause a crash
failtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(failtext)
predict(model, newdata = list(a=failtext), type = "embedding", which = "docs")
jwijffels commented 3 years ago

Thanks for the reproducible boom.

There is currently unfortunately a hard limit built into the C++ side of words having maximum length of 100. This can be seen here: https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/common_define.h#L11. Another hard limit set at the C++ side is that the maximum number of words a document can have is 1000 (https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/common_define.h#L14)

The boom probably occurs at https://github.com/bnosac/doc2vec/blob/3e947562a0a69e11eb292283116a4fdc9cf5c0f4/src/doc2vec/TaggedBrownCorpus.cpp#L88 or https://github.com/bnosac/doc2vec/blob/3e947562a0a69e11eb292283116a4fdc9cf5c0f4/src/doc2vec/TaggedBrownCorpus.cpp#L90

jwijffels commented 3 years ago

The package only allows to have words with a maximum length of 100 characters. The crash is caused by strcpy copying a larger amount into a memory block which only allows 100 characters, namely at https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L147 https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L254 I'll explicitely truncate the strings to 100 characters to avoid the crash at the C++ side but keep in mind your words should be smaller than 100 characters in size.

jwijffels commented 3 years ago

Fix using substr in commit https://github.com/bnosac/doc2vec/commit/7f584f9f0458691595c511480bcb940ebbc1fd93