Simulation to understand the effectiveness of latent class estimation

travis-leith commented 9 months ago

I have need of identifying latent classes in some data. I have come up with a simulation of a very simplified version of my suspected data generating process.

library(tidyverse)
library(lme4)
library(lcmm)

rm(list = ls())

set.seed(1)

n_subjects <- 200
n_times <- 5
n_obs <- n_subjects * n_times

sd_t <- 1
sd_obs <- 0.1

group_beta <- c(0.3, 1.5, 2.7)
n_groups <- length(group_beta)

subject_groups <-
  tibble(
    subject = 1:n_subjects,
    g = sample(n_groups, n_subjects, replace = TRUE)
  ) |> mutate(
    group_beta = group_beta[g]
  )

epsilon_t <- rnorm(n = n_times, sd = sd_t)

simulated_data <-
  crossing(subject = 1:n_subjects, t = 1:n_times) |>
  inner_join(subject_groups, by = "subject") |>
  mutate(
    epsilon_t = epsilon_t[t],
    epsilon_obs = rnorm(n = n(), sd = sd_obs),
    x = rnorm(n = n()),
    y = group_beta * x + epsilon_t + epsilon_obs
  )

#fit using lmer with known groups
m_lmer <- lmer(y ~ x + (x - 1 | g) + (1 | t), data = simulated_data)

#fit using lcmm with unknown groups
m_lcmm_b <- hlme(y ~ x, subject = "t", random = ~ 1, data = simulated_data, verbose = TRUE)
m_lcmm <- hlme(y ~ x, subject = "t", random = ~ 1, mixture = ~ x - 1, ng = 3, data = simulated_data, verbose = TRUE, B = m_lcmm_b)

#compare epsilon_t estimates
tibble(
  actual = epsilon_t,
  lmer = ranef(m_lmer)$t$`(Intercept)`,
  lcmm = ranef(m_lcmm)$intercept
)

#compare group_beta estimates
tibble(
  actual = group_beta,
  lmer = ranef(m_lmer)$g$x + fixef(m_lmer)["x"],
  lcmm = fixef(m_lcmm)[[2]][-1]
)

To explain my data, I have some coefficients which I suspect might differ by some latent grouping, and I have some common effect due to measuring different subjects at the same time.

After I simulate the data, I estimate the effects with known groups using lmer, and I do the same with unknown groups using hlme.

lmer is able to estimate the unknown group coefficients quite well, which is not surprising since it knows the groups in the first place.

However, the group coefficients (I realise I am displaying them arbitrarily since the ordering is abritrary) do not look much like the actuals.

So, my question / issue is: am I using hlme correctly? Is this a realistic test of this method?

travis-leith commented 9 months ago

The problem seems to be the fact that the random intercept is grouped by t, but that is the only thing that should be grouped by t. If I remove the random intercept component, and set subject = "subject", then it gets the right answer. Other than identifying the grouping to use for the random intercept, I am not sure what effect the parameter subject actually has. Is this a bug, or is this a user error?

CecileProust-Lima commented 9 months ago

Hi, the "subject" identifies the unique identifier unit. I assume this is the subject for you. The time is an explanatory variable and should be included in the right hand side of the formulas. In addition, I don't understand the (x - 1 | g) in the lmer. I would rather assume : x*g to have an effect of the covariate specific to each group g level. Think of the group g variable as any other categorical variable. The only difference with latent class models is that is it unknown, and so it needs to be retrieved from the data. But the idea is really the same as having a categorical variable in your model. Cécile

travis-leith commented 9 months ago

Hi @CecileProust-Lima, the (x-1|g) means there is a random slope assigned to each g. The -1, just like the convention in lm, means there is no intercept. If (x|g) is used instead, this would specify a random slope and intercept for each g.

Now that you mention it, an interaction effect makes sense in this simulated data. So now the model is

m_lmer <- lmer(y ~ x:g - 1 + (1 | t), data = simulated_data |> mutate(g = as.character(g)))

And this gives almost identical results in this case, although in a case with less clear separation, the random slope version would probably exhibit more shrinkage.

For the avoidance of doubt, the model being simulated is

$y_i = \beta_g x_i + \epsilon_t + \epsilon_i$, where $S(i)=s$ and $G(s)=g$.

By this I mean that the $i$ th observation, which is made of subject $s$, which belongs to latent group $g$ is a linear function of covariate $x$ and some randomness associated with time $t$ and some randomness associated with observation $i$.

If we assume that we do not know the mapping $G(s)$, can we discover it using lcmm, and still model $\epsilon_t$ at the same time?

CecileProust-Lima / lcmm

Simulation to understand the effectiveness of latent class estimation #237