TommyJones / textmineR

An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.
106 stars 34 forks source link

Discrepancy in topic content when summarizing and visualizing with LDAvis #72

Closed leungi closed 5 years ago

leungi commented 5 years ago

Apologies for the non reprex (due to size), but below is code using example from the textmineR package, so it should be reproducible.

Issue: reviewing model$summary to for, say, topic 1 t_1, it seems that it doesn't match with the t_1 marked in LDAvis plot.

I believe the definitions of phi P(token|topic) and theta P(topic|document) are the same across textmineR and LDAvis, so I'd expect similar topic/word clusters.


# load nih_sample data set from textmineR

# create a document term matrix 
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents
                 doc_names = nih_sample$APPLICATION_ID, # document names
                 ngram_window = c(1, 2), # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
                                  stopwords::stopwords(source = "smart")), # this is the default value
                 lower = TRUE, # lowercase - this is the default value
                 remove_punctuation = TRUE, # punctuation - this is the default
                 remove_numbers = TRUE, # numbers - this is the default
                 verbose = FALSE, # Turn off status bar for this demo
                 cpus = 2) # default is all available cpus on the system

dtm <- dtm[,colSums(dtm) > 2]


model <- FitLdaModel(dtm = dtm, 
                     k = 20,
                     iterations = 200, # I usually recommend at least 500 iterations or more
                     burnin = 180,
                     alpha = 0.1,
                     beta = 0.05,
                     optimize_alpha = TRUE,
                     calc_likelihood = TRUE,
                     calc_coherence = TRUE,
                     calc_r2 = TRUE,
                     cpus = 2) 

model$top_terms <- GetTopTerms(phi = model$phi, M = 10)

# Get the prevalence of each topic
# You can make this discrete by applying a threshold, say 0.05, for
# topics in/out of docuemnts. 
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100

# textmineR has a naive topic labeling tool based on probable bigrams
model$labels <- LabelTopics(assignments = model$theta > 0.05, 
                            dtm = dtm,
                            M = 1)

model$summary <- data.frame(topic = rownames(model$phi),
                            label = model$labels,
                            coherence = round(model$coherence, 3),
                            prevalence = round(model$prevalence,3),
                            top_terms = apply(model$top_terms, 2, function(x){
                              paste(x, collapse = ", ")
                            stringsAsFactors = FALSE)
model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]

# summary of document lengths
doc_lengths <- rowSums(dtm)
# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)

# create the JSON object to feed the visualization:
json <- createJSON(
  phi = model$phi,
  theta = model$theta,
  doc.length = doc_lengths,
  vocab = tf_mat$term,
  term.frequency = tf_mat$term_freq

serVis(json, open.browser = TRUE)
TommyJones commented 5 years ago

Hi. First, sorry it took me so long to get to this. Second, I don't know that there's anything I can do for you here, at least not without a reproducible example. If you can get me one, I'd happily take a look and see if it's an issue with textmineR, LDAvis, or something else. Maybe use the nih_sample_topic_model that ships with textmineR?

Anyway, closing for now. Feel free to re-open if you have that reproducible example I can work off of.

TommyJones commented 5 years ago

Sorry, I was being dumb. You did give me an example using the nih_sample data.

I'm not sure how LDAvis sorts terms. Maybe take this issue up there?

leungi commented 5 years ago

Glad you're able to respond @TommyJones; thanks!

I'll try to cross-post this with LDAvis.

TommyJones commented 5 years ago

So after playing with this for a couple minutes, it looks like their index is off. From the example I have, the LDAvis topic 1 is referencing t_12. LDAviz topic 2 references t_19. I haven't checked others, but that's odd behavior.

leungi commented 5 years ago

That seems to be the case from my memory.

Thanks for investigating!