dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Question: Is it possible to get updated doc_topic_distribution_with_prior with new data so it can be used with createJSON? #261

Closed jiunsiew closed 6 years ago

jiunsiew commented 6 years ago

Hi there,

Have been using your latest version of the package (0.5.1.2) which am finding really good. Thanks for your efforts with the development of this package!

I noticed that in the plot method, the phi and theta inputs to LDAvis::createJSON use the priors instead of the posteriors, and I have two questions regarding this:

  1. why is the prior used instead of the posterior (by posterior here, I mean doc_topic_distribution as opposed to the prior which is obtained from the doc_topic_distribution_prior() method)?
  2. if I apply the model to new data with the transform method, it appears that the priors are not updated. Is there a way to get the priors with the new data?

To give some context, I'm basically trying to calculate the term frequencies after applying the model to new data in the same way that LDAvis does.

Thanks again.

jiunsiew commented 6 years ago

Just to give a bit more context to the question above, here's some code that illustrates the problem. The object lda_model is the LDA object after fit_transform.

I'm basically trying to calculate term.topic.frequency but with the theta value that is from the new data, not the one from the model fit. Thanks.

## these arguments passed into the createJSON function
## see https://github.com/dselivanov/text2vec/blob/master/R/model_LDA.R
phi            <- lda_model$.__enclos_env__$private$topic_word_distribution_with_prior()
theta          <- lda_model$.__enclos_env__$private$doc_topic_distribution_with_prior()   ## this should be the new data
doc.length     <- lda_model$.__enclos_env__$private$doc_len
vocab          <- lda_model$.__enclos_env__$private$vocabulary
term.frequency <- colSums(lda_model$components)

## now calculate the ldavis metrics
## see https://github.com/cpsievert/LDAvis/blob/master/R/createJSON.R
# Set the values of a few summary statistics of the corpus and model:
dp <- dim(phi)  # should be K x W
dt <- dim(theta)  # should be D x K

N <- sum(doc.length)  # number of tokens in the data
W <- length(vocab)  # number of terms in the vocab
D <- length(doc.length)  # number of documents in the data
K <- dt[2]  # number of topics in the model

# compute counts of tokens across K topics (length-K vector):
# (this determines the areas of the default topic circles when no term is 
# highlighted)
topic.frequency <- colSums(theta * doc.length)            ## document topic distribution * number of tokens in document
topic.proportion <- topic.frequency/sum(topic.frequency)

# token counts for each term-topic combination (widths of red bars)
term.topic.frequency <- phi * topic.frequency  ## topic word distribution * topic frequency
dselivanov commented 6 years ago

Hi @jiunsiew. We use doc_topicdistributionwith_prior, not prior. This means that doc-topic pseudo count matrix theta is smoothed with doc-topic priors. You can check formulas here for example https://stlong0521.github.io/20160326%20-%20LDA.html.

if I apply the model to new data with the transform method, it appears that the priors are not updated. Is there a way to get the priors with the new data

Didn't quite get the question. Priors are hyper-parameters to the model and they are fixed once model is initialized.

dselivanov commented 6 years ago

I hope I've answered