brandmaier / semtree

Recursive Partitioning for Structural Equation Models
https://brandmaier.github.io/semtree/
GNU General Public License v3.0
13 stars 11 forks source link

Semtree fails when using tibble instead of data.frame #68

Open jhorzek opened 3 months ago

jhorzek commented 3 months ago

Hi Andreas,

I just tried using semtree with a data set that I had imported into R using haven. The resulting data.frame is a tibble data.frame. When running semtree based on a lavaan model, I was surprised that there were no splits for my data set. Only when casting the tibble to a base R data.frame, the model was partitioned correctly. When using OpenMx models, I found that tibbles result in errors. Here is an example:

library(semtree)
library(tibble)
library(OpenMx)
library(lavaan)

# For lavaan, semtree does not throw any warnings / errors when using tibble,
# but also does not run correctly:

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

HolzingerSwineford1939 <- HolzingerSwineford1939[complete.cases(HolzingerSwineford1939), ]

fit <- cfa(HS.model, data = tibble::as_tibble(HolzingerSwineford1939))
# Using tibble does not work:
tree <- semtree(model = fit, 
                data  = tibble::as_tibble(HolzingerSwineford1939))
print(tree)

# With data.frames, the tree is built correctly
tree <- semtree(model = fit, 
                data  = HolzingerSwineford1939,
                control = semtree.control(method = "score"))
print(tree)

## For OpenMx, I get an error, but one that reports issues with P1 not being
## a factor:

set.seed(23)
N <- 1000
M <- 5
icept <- rnorm(N, 10, sd = 4)
slope <- rnorm(N, 3, sd = 1.2)
p1 <- sample(c(0, 1), size = N, replace = TRUE)
loadings <- 0:4
x <- (slope + p1 * 5) %*% t(loadings) + 
  matrix(rep(icept, each = M), byrow = TRUE, ncol = M) + 
  rnorm(N * M, sd = .08)
colnames(x) <- paste0("X", 1:M)

# Use tibble
growth.data    <- tibble::as_tibble(x)
growth.data$P1 <- as.factor(p1)

manifests <- names(growth.data)[1:5]
growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification",
                            type="RAM",
                            manifestVars=manifests,
                            latentVars=c("intercept","slope"),
                            mxData(growth.data, type="raw"),
                            # residual variances
                            mxPath(
                              from=manifests,
                              arrows=2,
                              free=TRUE,
                              values = c(.1, .1, .1, .1, .1),
                              labels=c("residual","residual","residual","residual","residual")
                            ),
                            # latent variances and covariance
                            mxPath(
                              from=c("intercept","slope"),
                              arrows=2,
                              connect="unique.pairs",
                              free=TRUE,
                              values=c(2, 0, 1),
                              labels=c("vari", "cov", "vars")
                            ),
                            # intercept loadings
                            mxPath(
                              from="intercept",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(1, 1, 1, 1, 1)
                            ),
                            # slope loadings
                            mxPath(
                              from="slope",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(0, 1, 2, 3, 4)
                            ),
                            # manifest means
                            mxPath(
                              from="one",
                              to=manifests,
                              arrows=1,
                              free=FALSE,
                              values=c(0, 0, 0, 0, 0)
                            ),
                            # latent means
                            mxPath(
                              from="one",
                              to=c("intercept", "slope"),
                              arrows=1,
                              free=TRUE,
                              values=c(1, 1),
                              labels=c("meani", "means")
                            )
) # close model

# fit the model to the entire dataset
growthCurveModel <- mxRun(growthCurveModel)

# Using tibble does not work:
tree <- semtree(model = growthCurveModel, 
                data  = growth.data)
# Using data.frame does work:
tree <- semtree(model = growthCurveModel, 
                data  = as.data.frame(growth.data))

Best, Jannik

brandmaier commented 3 months ago

Thanks. For now, I will add a type check whether data is really a data.frame. I am not sure whether supporting tibbles is possible at all but I will keep it on my list.