Open panoptikum opened 5 years ago
I've found this on stackoverflow:
I guess that still is the last status?
How would look such a calculation that you've mentioned on stackoverflow, @bstewart ? I think my programming skills could be good enough.
Well, I've digged through the estimateEffect() and as far as I understand the code without uncertainty I can sum the thetasims of the topics I would like to group? Please correct me, if I'm mistaken. In that case, I would just modify the estimateEffect() or clone it and use this modified version for my needs.
Though this seems too easy to be true?
And what about the uncertain case? How would it differ?
You are totally correct- without uncertainty you can just add them. With the uncertainty you can simulate draws from the posterior using thetaPosterior() and just add them within each draw!
Ok, sounds very doable. If you like, I can share my modified estimateEffect here and may open a PR?
That sounds great. I've worked on some code to do this and ultimately the challenge has been in defining a syntax for people to specify what they want. Any suggestions would be much appreciated!
True, the syntax for specification will be the most challenging part. I'll think about it and make a proposal soon.
My proposal is a follows:
group
to the function estimateEffect. The default is FALSE.'c(6)~...'
or 'c(6, 12)~...'
for topic 6 or topic 6 and 12. I propose 'c(6, 12) & c(8, 14)~ ...'
to indicate to summarise the topics 6 and 12 / 8 and 14 as one topic.I've already written code that includes this idea in estimateEffect and only uses base R. If you like my proposed syntax, I can open a PR.
I also want to merge topics in stm model, but my programming skills and statistical knowledge are not enough to do it by myself. Do you mind if I ask your solution to merge topics? @panoptikum If it is hard to show your own code of estimateEffect which you corrected, can I know briefly about the procedure to get estimated effect of stm model with merged topics?
Hello, sure. You find below my modified functions and some internal functions of STM that you have to load:
estimateEffectDEV <- function (formula, stmobj, metadata = NULL, uncertainty = c("Global",
"Local", "None"), group=FALSE, documents = NULL, nsims = 25, prior = NULL)
{
origcall <- match.call()
thetatype <- match.arg(uncertainty)
if (thetatype == "None")
nsims <- 1
if (!is.null(documents)) {
args <- asSTMCorpus(documents, data = metadata)
documents <- args$documents
metadata <- args$data
}
if (!inherits(formula, "formula"))
stop("formula must be a formula object.")
if (!is.null(metadata) & !is.data.frame(metadata))
metadata <- as.data.frame(metadata)
termobj <- terms(formula, data = metadata)
if (attr(termobj, "response") == 1) {
response <- as.character(formula)[2]
if (group==TRUE) {
grouped_topics <- trimws(strsplit(response, "&")[[1]])
K <- 1:length(grouped_topics)
} else {
K <- eval(parse(text = response))
if (!(posint(K) && max(K) <= stmobj$settings$dim$K))
stop("Topics specified as response in formula must be a set of positive integers equal to or less than the number of topics in the model.")
}
formula <- as.formula(as.character(formula)[c(1, 3)])
termobj <- terms(formula, data = metadata)
} else {
K <- 1:stmobj$settings$dim$K
}
mf <- model.frame(termobj, data = metadata)
xmat <- model.matrix(termobj, data = metadata)
varlist <- all.vars(termobj)
if (!is.null(metadata)) {
data <- metadata[, varlist, drop = FALSE]
}
else {
templist <- list()
for (i in 1:length(varlist)) {
templist[[i]] <- get(varlist[i])
}
data <- data.frame(templist)
names(data) <- varlist
rm(templist)
}
metadata <- data
rm(data)
if (!is.null(prior)) {
if (!is.matrix(prior)) {
prior <- diag(prior, nrow = ncol(xmat))
}
if (ncol(prior) != ncol(xmat))
stop("number of columns in prior does not match columns in design matrix")
prior.pseudo <- chol(prior)
xmat <- rbind(xmat, prior.pseudo)
}
qx <- qr(xmat)
if (qx$rank < ncol(xmat)) {
prior <- diag(1e-05, nrow = ncol(xmat))
prior.pseudo <- chol(prior)
xmat <- rbind(xmat, prior.pseudo)
qx <- qr(xmat)
warning("Covariate matrix is singular. See the details of ?estimateEffect() for some common causes.\n Adding a small prior 1e-5 for numerical stability.")
}
storage <- vector(mode = "list", length = length(K))
K_grouped <- c()
for (i in 1:nsims) {
if (thetatype == "None")
thetasims <- stmobj$theta
else {
thetasims <- thetaPosterior(stmobj, nsims = 1, type = thetatype,
documents = documents)
thetasims <- do.call(rbind, thetasims)
}
for (k in K) {
if (group==TRUE) {
topics_to_group <- eval(parse(text = grouped_topics[k]))
grouped_thetasims <- apply(thetasims[, topics_to_group], 1, sum)
lm.mod <- qr.lm(grouped_thetasims, qx)
} else {
lm.mod <- qr.lm(thetasims[, k], qx)
}
storage[[which(k == K)]][[i]] <- summary.qr.lm(lm.mod)
if (group==TRUE & i==1)
K_grouped[k] <- paste0(c(k, ": ", paste0(topics_to_group, collapse = "+")), collapse = "")
}
}
toreturn <- list(parameters = storage, topics = K_grouped, call = origcall,
uncertainty = thetatype, formula = formula, data = metadata,
modelframe = mf, varlist = varlist)
class(toreturn) <- "estimateEffect"
return(toreturn)
}
qr.lm <- function(y, qx) {
if(length(y)!=nrow(qx$qr)) {
#probably don't match because of a prior
if(length(y)!=(nrow(qx$qr)-ncol(qx$qr))) stop("number of covariate observations does not match number of docs")
#if it passes this check its the prior. thus
y <- c(y,rep(0, ncol(qx$qr)))
}
beta <- solve.qr(qx, y)
residuals <- qr.resid(qx,y)
fitted.values <- qr.fitted(qx,y)
df.residual <- length(fitted.values) - qx$rank
out <- list(coefficients=beta, residuals=residuals,
fitted.values=fitted.values,
df.residual=df.residual, rank=qx$rank, qr=qx)
out
}
#this function rewrites the summary.lm() function
# to calculate from our reduced regression
summary.qr.lm <- function (object) {
z <- object
p <- z$rank
rdf <- z$df.residual
Qr <- object$qr
n <- nrow(Qr$qr)
p1 <- 1L:p
r <- z$residuals
f <- z$fitted.values
mss <- ifelse(attr(z$terms, "intercept"), sum((f - mean(f))^2), sum(f^2))
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- z$coefficients[Qr$pivot[p1]]
sigma <- sqrt(resvar)
list(est=est, vcov=(sigma^2 * R))
}
print.summary.estimateEffect <- function(x, digits = max(3L, getOption("digits") - 3L),
signif.stars = getOption("show.signif.stars"), ...) {
cat("\nCall:\n", paste(deparse(x$call), sep = "\n", collapse = "\n"),
"\n\n", sep = "")
for(i in 1:length(x$tables)) {
cat(sprintf("\nTopic %s:\n", x$topics[i]))
cat("\nCoefficients:\n")
coefs <- x$tables[[i]]
stats::printCoefmat(coefs, digits = digits, signif.stars = signif.stars,
na.print = "NA", ...)
cat("\n")
}
invisible(x)
}
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) {
abs(x - round(x)) < tol
}
posint <- function(x) {
all(is.wholenumber(x)) & all(x>0)
}
nonnegint <- function(x) {
all(is.wholenumber(x)) & all(x>=0)
}
extract.estimateEffect <- function (x, covariate, model = NULL, topics = x$topics, method = "pointestimate",
cov.value1 = NULL, cov.value2 = NULL, moderator = NULL, moderator.value = NULL,
npoints = 100, nsims = 100, ci.level = 0.95, custom.labels = NULL,
labeltype = "numbers", n = 7, frexw = 0.5)
{
cthis <- stm:::produce_cmatrix(prep = x, covariate = covariate,
method = method, cov.value1 = cov.value1, cov.value2 = cov.value2,
moderator = moderator, moderator.value = moderator.value)
simbetas <- stm:::simBetas(parameters = x$parameters, nsims = nsims)
uvals <- cthis$cdata[[covariate]]
offset <- (1 - ci.level)/2
labels <- stm:::createLabels(labeltype = labeltype, covariate = covariate,
method = method, cdata = cthis$cdata, cov.value1 = cov.value1,
cov.value2 = cov.value2, model = model, n = n, topics = x$topics,
custom.labels = custom.labels, frexw = frexw)
out <- lapply(topics, function(i) {
sims <- cthis$cmatrix %*% t(simbetas[[which(x$topics ==
i)]])
if (method == "difference") {
diff <- sims[1, ] - sims[2, ]
out_inner <- data.frame(method = method, topic = i,
covariate = covariate, covariate.value = paste0(cov.value1,
"-", cov.value2), estimate = mean(diff), std.error = sd(diff),
ci.level = ci.level, ci.lower = quantile(diff,
offset), ci.upper = quantile(diff, 1 - offset),
label = labels[which(x$topics == i)])
}
else {
#browser()
out_inner <- data.frame(method = method, topic = i,
covariate = covariate, covariate.value = uvals,
estimate = rowMeans(sims), std.error = sqrt(apply(sims,
1, stats::var)), ci.level = ci.level, ci.lower = apply(sims,
1, quantile, probs = offset), ci.upper = apply(sims,
1, quantile, probs = (1 - offset)), label = labels[which(x$topics ==
i)])
out_inner["tval"] <- out_inner["estimate"]/out_inner["std.error"]
rdf <- nrow(x$data) - length(out_inner["estimate"])
out_inner["p"] <- 2 * stats::pt(abs(out_inner[["tval"]]), rdf, lower.tail = FALSE)
}
if (!is.null(moderator)) {
out_inner$moderator <- moderator
out_inner$moderator.value <- moderator.value
}
rownames(out_inner) <- NULL
return(out_inner)
})
out <- do.call("rbind", out)
return(out)
}
Thank you, so much! :) @panoptikum
Thanks for sharing your code, @panopticon. I also hope to estimate the effects of topic groups, so this is very helpful. Using your modified functions, I was able to estimate the effects of a topic group. However, using the modified summary function does not work for me, so I cannot look at the effects for topic groups. Is this a bug or am I not using your modified functions as intended? In case you or @jehoonchae are still following this thread, I would be grateful for any hint.
Reprex:
library(stm)
model <- stm(documents = poliblog5k.docs, vocab = poliblog5k.voc, K = 3)
#After creating modified functions with panopticon’s code from above:
test <- estimateEffectDEV(c(1,2) ~ rating + day, stmobj = model, metadata = poliblog5k.meta, groups = TRUE)
print.summary.estimateEffect(test) # This throws an error
summary(test) # This throws an error as well
The errors that I get are:
Error in stats::printCoefmat(coefs, digits = digits, signif.stars = signif.stars, : 'x' must be coefficient matrix/data frame
Error in base::colMeans(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions
Since I received several questions related to using the @panoptikum's customized function estimateEffectDEV()
and drawing a plot on time, I wrote a brief explication on the issue. Please check the files named stm-merge.*
in the repo. I hope it could be the answer to your question.
Thanks @jehoonchae, your work is helpful. However, my main interest is to obtain a regression table just as summary.estimateEffect produces it. As I wrote in my previous post, I think that @panopticon's modified version of print.summary.estimateEffect() should do this job, but it throws an error.
Sorry, I misunderstood your question.
Please use the following code. I added marginally modified summary.estimateEffect()
function.
estimateEffectDEV <- function (formula, stmobj, metadata = NULL, uncertainty = c("Global",
"Local", "None"), group=FALSE, documents = NULL, nsims = 25, prior = NULL)
{
origcall <- match.call()
thetatype <- match.arg(uncertainty)
if (thetatype == "None")
nsims <- 1
if (!is.null(documents)) {
args <- asSTMCorpus(documents, data = metadata)
documents <- args$documents
metadata <- args$data
}
if (!inherits(formula, "formula"))
stop("formula must be a formula object.")
if (!is.null(metadata) & !is.data.frame(metadata))
metadata <- as.data.frame(metadata)
termobj <- terms(formula, data = metadata)
if (attr(termobj, "response") == 1) {
response <- as.character(formula)[2]
if (group==TRUE) {
grouped_topics <- trimws(strsplit(response, "&")[[1]])
K <- 1:length(grouped_topics)
} else {
K <- eval(parse(text = response))
if (!(posint(K) && max(K) <= stmobj$settings$dim$K))
stop("Topics specified as response in formula must be a set of positive integers equal to or less than the number of topics in the model.")
}
formula <- as.formula(as.character(formula)[c(1, 3)])
termobj <- terms(formula, data = metadata)
} else {
K <- 1:stmobj$settings$dim$K
}
mf <- model.frame(termobj, data = metadata)
xmat <- model.matrix(termobj, data = metadata)
varlist <- all.vars(termobj)
if (!is.null(metadata)) {
data <- metadata[, varlist, drop = FALSE]
}
else {
templist <- list()
for (i in 1:length(varlist)) {
templist[[i]] <- get(varlist[i])
}
data <- data.frame(templist)
names(data) <- varlist
rm(templist)
}
metadata <- data
rm(data)
if (!is.null(prior)) {
if (!is.matrix(prior)) {
prior <- diag(prior, nrow = ncol(xmat))
}
if (ncol(prior) != ncol(xmat))
stop("number of columns in prior does not match columns in design matrix")
prior.pseudo <- chol(prior)
xmat <- rbind(xmat, prior.pseudo)
}
qx <- qr(xmat)
if (qx$rank < ncol(xmat)) {
prior <- diag(1e-05, nrow = ncol(xmat))
prior.pseudo <- chol(prior)
xmat <- rbind(xmat, prior.pseudo)
qx <- qr(xmat)
warning("Covariate matrix is singular. See the details of ?estimateEffect() for some common causes.\n Adding a small prior 1e-5 for numerical stability.")
}
storage <- vector(mode = "list", length = length(K))
K_grouped <- c()
for (i in 1:nsims) {
if (thetatype == "None")
thetasims <- stmobj$theta
else {
thetasims <- thetaPosterior(stmobj, nsims = 1, type = thetatype,
documents = documents)
thetasims <- do.call(rbind, thetasims)
}
for (k in K) {
if (group==TRUE) {
topics_to_group <- eval(parse(text = grouped_topics[k]))
grouped_thetasims <- apply(thetasims[, topics_to_group], 1, sum)
lm.mod <- qr.lm(grouped_thetasims, qx)
} else {
lm.mod <- qr.lm(thetasims[, k], qx)
}
storage[[which(k == K)]][[i]] <- summary.qr.lm(lm.mod)
if (group==TRUE & i==1)
K_grouped[k] <- paste0(c(k, ": ", paste0(topics_to_group, collapse = "+")), collapse = "")
}
}
toreturn <- list(parameters = storage, topics = K_grouped, call = origcall,
uncertainty = thetatype, formula = formula, data = metadata,
modelframe = mf, varlist = varlist)
class(toreturn) <- "estimateEffect"
return(toreturn)
}
qr.lm <- function(y, qx) {
if(length(y)!=nrow(qx$qr)) {
#probably don't match because of a prior
if(length(y)!=(nrow(qx$qr)-ncol(qx$qr))) stop("number of covariate observations does not match number of docs")
#if it passes this check its the prior. thus
y <- c(y,rep(0, ncol(qx$qr)))
}
beta <- solve.qr(qx, y)
residuals <- qr.resid(qx,y)
fitted.values <- qr.fitted(qx,y)
df.residual <- length(fitted.values) - qx$rank
out <- list(coefficients=beta, residuals=residuals,
fitted.values=fitted.values,
df.residual=df.residual, rank=qx$rank, qr=qx)
out
}
#this function rewrites the summary.lm() function
# to calculate from our reduced regression
summary.qr.lm <- function (object) {
z <- object
p <- z$rank
rdf <- z$df.residual
Qr <- object$qr
n <- nrow(Qr$qr)
p1 <- 1L:p
r <- z$residuals
f <- z$fitted.values
mss <- ifelse(attr(z$terms, "intercept"), sum((f - mean(f))^2), sum(f^2))
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- z$coefficients[Qr$pivot[p1]]
sigma <- sqrt(resvar)
list(est=est, vcov=(sigma^2 * R))
}
rmvnorm<-function(n,mu,Sigma,chol.Sigma=chol(Sigma)) {
E<-matrix(rnorm(n*length(mu)),n,length(mu))
t( t(E%*%chol.Sigma) +c(mu))
}
summary.estimateEffect <- function(object, topics=NULL, nsim=500, ...) {
if(is.null(topics)) topics <- object$topics
if(any(!(topics %in% object$topics))) {
stop("Some topics specified with the topics argument are not available in this estimateEffect object.")
}
tables <- vector(mode="list", length=length(topics))
for(i in seq_along(topics)) {
topic = as.numeric(strsplit(topics[i], ":")[[1]][1])
sims <- lapply(object$parameters[[topic]], function(x) rmvnorm(nsim, x$est, x$vcov))
sims <- do.call(rbind,sims)
est<- colMeans(sims)
se <- sqrt(apply(sims,2, stats::var))
tval <- est/se
rdf <- nrow(object$data) - length(est)
p <- 2 * stats::pt(abs(tval), rdf, lower.tail = FALSE)
coefficients <- cbind(est, se, tval, p)
rownames(coefficients) <- attr(object$parameters[[1]][[1]]$est, "names")
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
tables[[i]] <- coefficients
}
out <- list(call=object$call, topics=topics, tables=tables)
class(out) <- "summary.estimateEffect"
return(out)
}
print.summary.estimateEffect <- function(x, digits = max(3L, getOption("digits") - 3L),
signif.stars = getOption("show.signif.stars"), ...) {
cat("\nCall:\n", paste(deparse(x$call), sep = "\n", collapse = "\n"),
"\n\n", sep = "")
for(i in 1:length(x$tables)) {
cat(sprintf("\nTopic %s:\n", x$topics[i]))
cat("\nCoefficients:\n")
coefs <- x$tables[[i]]
stats::printCoefmat(coefs, digits = digits, signif.stars = signif.stars,
na.print = "NA", ...)
cat("\n")
}
invisible(x)
}
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) {
abs(x - round(x)) < tol
}
posint <- function(x) {
all(is.wholenumber(x)) & all(x>0)
}
nonnegint <- function(x) {
all(is.wholenumber(x)) & all(x>=0)
}
extract.estimateEffect <- function (x, covariate, model = NULL, topics = x$topics, method = "pointestimate",
cov.value1 = NULL, cov.value2 = NULL, moderator = NULL, moderator.value = NULL,
npoints = 100, nsims = 100, ci.level = 0.95, custom.labels = NULL,
labeltype = "numbers", n = 7, frexw = 0.5)
{
cthis <- stm:::produce_cmatrix(prep = x, covariate = covariate,
method = method, cov.value1 = cov.value1, cov.value2 = cov.value2,
moderator = moderator, moderator.value = moderator.value)
simbetas <- stm:::simBetas(parameters = x$parameters, nsims = nsims)
uvals <- cthis$cdata[[covariate]]
offset <- (1 - ci.level)/2
labels <- stm:::createLabels(labeltype = labeltype, covariate = covariate,
method = method, cdata = cthis$cdata, cov.value1 = cov.value1,
cov.value2 = cov.value2, model = model, n = n, topics = x$topics,
custom.labels = custom.labels, frexw = frexw)
out <- lapply(topics, function(i) {
sims <- cthis$cmatrix %*% t(simbetas[[which(x$topics ==
i)]])
if (method == "difference") {
diff <- sims[1, ] - sims[2, ]
out_inner <- data.frame(method = method, topic = i,
covariate = covariate, covariate.value = paste0(cov.value1,
"-", cov.value2), estimate = mean(diff), std.error = sd(diff),
ci.level = ci.level, ci.lower = quantile(diff,
offset), ci.upper = quantile(diff, 1 - offset),
label = labels[which(x$topics == i)])
}
else {
#browser()
out_inner <- data.frame(method = method, topic = i,
covariate = covariate, covariate.value = uvals,
estimate = rowMeans(sims), std.error = sqrt(apply(sims,
1, stats::var)), ci.level = ci.level, ci.lower = apply(sims,
1, quantile, probs = offset), ci.upper = apply(sims,
1, quantile, probs = (1 - offset)), label = labels[which(x$topics ==
i)])
out_inner["tval"] <- out_inner["estimate"]/out_inner["std.error"]
rdf <- nrow(x$data) - length(out_inner["estimate"])
out_inner["p"] <- 2 * stats::pt(abs(out_inner[["tval"]]), rdf, lower.tail = FALSE)
}
if (!is.null(moderator)) {
out_inner$moderator <- moderator
out_inner$moderator.value <- moderator.value
}
rownames(out_inner) <- NULL
return(out_inner)
})
out <- do.call("rbind", out)
return(out)
}
Then try what you did again with summary()
like the following.
library(stm)
source('estimateEffectDEV.R')
model <- stm(documents = poliblog5k.docs, vocab = poliblog5k.voc, K = 3)
test <- estimateEffectDEV(c(1,2) ~ rating + s(day),
stmobj = model,
metadata = poliblog5k.meta,
group = TRUE)
summary(test)
Call:
estimateEffectDEV(formula = c(1, 2) ~ rating + s(day), stmobj = model,
metadata = poliblog5k.meta, group = TRUE)
Topic 1: 1+2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.584850 0.041278 14.169 <2e-16 ***
ratingLiberal 0.088502 0.009519 9.298 <2e-16 ***
s(day)1 0.076430 0.081379 0.939 0.3477
s(day)2 -0.094751 0.047893 -1.978 0.0479 *
s(day)3 -0.069118 0.059332 -1.165 0.2441
s(day)4 -0.046397 0.048474 -0.957 0.3385
s(day)5 -0.056602 0.053356 -1.061 0.2888
s(day)6 0.046230 0.049852 0.927 0.3538
s(day)7 0.145387 0.052234 2.783 0.0054 **
s(day)8 0.072769 0.060424 1.204 0.2285
s(day)9 0.028451 0.062221 0.457 0.6475
s(day)10 -0.092038 0.059254 -1.553 0.1204
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Glad to read that you've figured it out. Sorry for any errors, but I do not have the time to cross validate at the moment.
Thanks so much to @panoptikum and @jehoonchae for this code. It's really helped me out. And to @bstewart for the stm package too!
Thank you very much @panoptikum, @jehoonchae & of course @bstewart. You answered the question that kept me busy for weeks now. I just had a short question regarding the uncertainty issue raised above. Are the different uncertainty measures in estimateEffect
applicable in your amendment, @panoptikum & @jehoonchae?
My data was ran through without problems using the 'Global' option and your example - @jehoonchae - would also use this, if I'm correct? But I'm unsure if I should proceed with this or if it is better for me to run the regressions without this. Thank you so much!
Thanks so much, @panoptikum , @jehoonchae , and @bstewart for the modified dev version of estimateEffect(). This is super helpful.
I was wondering how to modify and make use of the estimateeffect() code when the dependent variable is a proportion of a combination of multiple topics (e.g., (topic1 + topic3)/(topic2 + topic4 + topic1 + topic3)). Essentially, I was keen to understand how a metadata relates to relative topic emphasis.
Request your help.
Dear STM community,
I'm wondering whether it is possible to "merge" two topics into one as they correlate a lot or because I think that the topics align much. I would like to estimate an effect for the resulting topic, but if I am not mistaken when I specify two topics for estimateEffect, then the effect is computed separately? I have not fully understood the mechanisms behind STM, i.e. what all the data that is returned means, so maybe I can just add something up and feed this to the estimateEffect function?
Many thanks for any help.
panoptikum