bstewart / stm

An R Package for the Structural Topic Model
Other
397 stars 98 forks source link

Question on different calculations of word labels and probabilities #170

Open NilsDroste opened 6 years ago

NilsDroste commented 6 years ago

Hi Brandon,

I am estimating an stm model with prevalence covariates and a 2-level factor content covariate. I am struggling to understand the exact difference between the following ways of obtaining characteristic words for a topic and hope a set of related questions is in order:

a) stm::plot.STM(..., type = "labels"): Labels b) stm::labelTopics(): Topic Words

As far as I can see:

a) calculates topic labels as lab[topic,] w/ weights <- model$settings$covariates$betaindex tab <- table(weights) weights <- tab/sum(tab) beta <- exp(model$beta$logbeta[[1]]) * weights[1] for (i in 2:length(model$beta$logbeta)) { beta <- beta + exp(model$beta$logbeta[[i]]) * weights[i] } lab <- t(apply(beta, 1, function(x) model$vocab[order(x, decreasing = TRUE)[1:n]]))

in my case this gives [1] "sustain" "develop" "approach" "need" ...

Does simply calculate the most probable words per topic?

b) calculates Topic Words as out$topics w/ labs <- lapply(model$beta$kappa$params, function(x) { windex <- order(x, decreasing = TRUE)[1:n] ifelse(x[windex] > 0.001, vocab[windex], "") }) A <- model$settings$dim$A anames <- model$settings$covariates$yvarlevels i1 <- K + 1 i2 <- K + A intnums <- (i2 + 1):nrow(labs) labs <- do.call(rbind, labs) out$topics <- labs[topics, , drop = FALSE]

in my case this one gives: [1] "aptitud" "institutionalis" "stakehold" "sdgs" ...

Does this calculate the probable words per topic based on the log-transformed rate deviations from corpus-wide background distribution over words given content covariate A? Or in other words, are these the most probable words given the influence of content covariates on the word probability distribution?

So, I am wondering why:

  1. stm::plot.STM(..., type = "labels") does not provide the same labels as stm::plot.STM(..., type = "summary") while the latter does give the same as stm:::labelTopics(): Topic Words. Also stm::plot.STM(..., type = "perspectives") seems to displays the same word probability distribution as stm::plot.STM(..., type = "labels") as does stm::cloud(). Shouldn't it all be the same? Else, could the most probable words without content variable influence be included in stm::labelTopics() - not sure what makes most sense here.

  2. Or, respectively, do you think it is okay to inspect a content covariate model topic by most probable words without covariate influences on word probability distribution (as e.g. calculated in a)? If yes, is the calculation correct, or would it rather have to be calculated (if it was accessible) as in out$prob of stm::labelTopics(), which in my case also deviates slightly from most probable words calculated as in stm::plot.STM(..., type = "labels")`?

  3. And a bonus one: In Roberts et al. (2016). "A model of text for experimentation in the social sciences" you calculated the most probable words per prevalence covariate (news source). How have you obtained these?

Thanks N

NilsDroste commented 6 years ago

I just found stm::sageLabels() which does help a lot understanding this. Impressive. This was what I was looking for. The Kappa (most propbable words given content covariate influence?) is more or less clear, what is Kappa against baseline, i.e. why is that different from marginal word probabilities?

meier-flo commented 1 year ago

Hi @NilsDroste

I know that it has been a while, but maybe you can share your experience or solutions on how you answered the questions yourself. When training an stm with covariates I ran into similar questions, so I was wondering:

  1. The Topic Kappa words are in many cases very specific, which makes the topics hard to interpret and label. I would also prefer to use and report Marginal Highest Prob and/or Marginal FREX words, but as I understand it the covariate influence is averaged out for those words?
  2. The main reason for using the covariates is because I want to run estimateEffect()which only really works if the model was built with the same covariates as will be used in the regression models?
  3. Is there any literature that explain the kappa values in greater detail?

Any help would be highly appreciated, Florian