JonasRieger / ldaPrototype

Determine a Prototype from a number of runs of Latent Dirichlet Allocation.
GNU General Public License v3.0
7 stars 1 forks source link

docs object expects all word frequencies to be 1 - transformation from dfm object (quanteda) #10

Open JonasRieger opened 3 years ago

JonasRieger commented 3 years ago

The docs object expects (for technical reasons) that all words occur with frequency 1. If words occur several times, they appear several times each with frequency 1. In the quanteda package there are dfm objects that also allow values greater than 1. If you do your preprocessing in quanteda and want to use quanteda::dfm2lda to convert your object into the necessary structure, you need one more step to fulfill the requirements for the docs object. Just execute the following line:

docs = lapply(docs, function(x) rbind(rep(x[1,], x[2,]), 1))

This replicates words with multiple occurrences and protects you from the error message all(sapply(docs, function(x) all(x[2, ] == 1))) is not TRUE in LDARep and similar functions.

abitter commented 3 years ago

Unfortunately, this yields a numeric matrix (at least in R 4.1.1), whereas LDARep expects an integer matrix. There might be a more elegant solution, but this did the trick for me:

docs <- lapply(docs, function(x) rbind(rep(as.integer(x[1,]), as.integer(x[2,])), as.integer(1)))

JonasRieger commented 3 years ago

Yeah, you're right.

docs = convert(dfmat, "lda")$documents docs = lapply(docs, function(x) rbind(rep(x[1,], x[2,]), 1L))

should do it as well.