Closed jiunsiew closed 6 years ago
Just to give a bit more context to the question above, here's some code that illustrates the problem. The object lda_model
is the LDA
object after fit_transform
.
I'm basically trying to calculate term.topic.frequency
but with the theta
value that is from the new data, not the one from the model fit. Thanks.
## these arguments passed into the createJSON function
## see https://github.com/dselivanov/text2vec/blob/master/R/model_LDA.R
phi <- lda_model$.__enclos_env__$private$topic_word_distribution_with_prior()
theta <- lda_model$.__enclos_env__$private$doc_topic_distribution_with_prior() ## this should be the new data
doc.length <- lda_model$.__enclos_env__$private$doc_len
vocab <- lda_model$.__enclos_env__$private$vocabulary
term.frequency <- colSums(lda_model$components)
## now calculate the ldavis metrics
## see https://github.com/cpsievert/LDAvis/blob/master/R/createJSON.R
# Set the values of a few summary statistics of the corpus and model:
dp <- dim(phi) # should be K x W
dt <- dim(theta) # should be D x K
N <- sum(doc.length) # number of tokens in the data
W <- length(vocab) # number of terms in the vocab
D <- length(doc.length) # number of documents in the data
K <- dt[2] # number of topics in the model
# compute counts of tokens across K topics (length-K vector):
# (this determines the areas of the default topic circles when no term is
# highlighted)
topic.frequency <- colSums(theta * doc.length) ## document topic distribution * number of tokens in document
topic.proportion <- topic.frequency/sum(topic.frequency)
# token counts for each term-topic combination (widths of red bars)
term.topic.frequency <- phi * topic.frequency ## topic word distribution * topic frequency
Hi @jiunsiew. We use doc_topicdistributionwith_prior, not prior. This means that doc-topic pseudo count matrix theta
is smoothed with doc-topic priors. You can check formulas here for example https://stlong0521.github.io/20160326%20-%20LDA.html.
if I apply the model to new data with the transform method, it appears that the priors are not updated. Is there a way to get the priors with the new data
Didn't quite get the question. Priors are hyper-parameters to the model and they are fixed once model is initialized.
I hope I've answered
Hi there,
Have been using your latest version of the package (0.5.1.2) which am finding really good. Thanks for your efforts with the development of this package!
I noticed that in the
plot
method, thephi
andtheta
inputs toLDAvis::createJSON
use the priors instead of the posteriors, and I have two questions regarding this:doc_topic_distribution
as opposed to the prior which is obtained from thedoc_topic_distribution_prior()
method)?transform
method, it appears that the priors are not updated. Is there a way to get the priors with the new data?To give some context, I'm basically trying to calculate the term frequencies after applying the model to new data in the same way that
LDAvis
does.Thanks again.