chrisraynerr commented 1 year ago

Hello,

I have 5 depression questionnaire items from the SCL measured repeatedly in a sample of mothers during and after pregnancy (for multiple pregnancies). There are 61k participants. The median number of responses is 6 and the maximum is 16 time-points. The time-points are not at regular intervals - but are similar across participant (specific time-points during and after pregnancy). Is it possible to use the multlcmm function to estimate SCL IRT factor scores -- and produce a scores for every observation?

I want to take the factor scores and estimate a model that looks like this:

SclFactorScore ~ SNP + Age + CurrentlyPregnant + Postnatal + MedicationStatus + (Age | ID)

So I want to estimate mean effects and Age specific effects adjusting for repeated observations. I'm wondering, would I just estimate the IRT paramaters using multlcmm and then just re-weight the items? If I want to test for the effects of various covariates on the latent trait, should I also include them when estimating the IRT model?

Many thanks, Chris

VivianePhilipps commented 1 year ago

Hi Chris,

if I understood correctly, you want to assess the effect of 5 covariates (SNP, Age, CurrentlyPregnant, Postnatal, MedicationStatus) on depression and you have 5 items measuring depression. If so, you don't need to proceed in 2 steps, the multlcmm function does both in the same time. If you estimate the following model : m <- multlcmm(item1+item2+item3+item4+item5 ~ SNP + Age + CurrentlyPregnant + Postnatal + MedicationStatus, random=~Age, subject="ID", link="threholds")

a latent trait underlying the 5 items is defined, and the covariates effects are on this latent trait. You can find an example on our website (https://cecileproust-lima.github.io/lcmm/articles/latent_process_model_with_multlcmm_IRT.html) where we use 7 items of HADS to measure depression.

Best,

Viviane

chrisraynerr commented 1 year ago

Hi Viviane,

Thank you for your response. I have been working on this problem for some time and I am still quite stuck.

My aim is to run a repeated measures GWAS - to estimate genetic effects (SNP effects) on depression at age 30, and the interaction between each SNP and Age (difference from 30).

Ideal model: lme4::lmer(Dep ~ SNP + SNPAge + Age + Cov + (1|ID), data=Df) [too slow to run for millions of SNPs] Estimates: B0SNP, B1SNPAge

This requires running millions of models, which is very slow. Therefore one approach used to speed up computation is to estimate a baseline mixed model, then use the random effects parameters as Empirical Bayesian Estimators to estimate naive-empirical bayesian estimates. So,

(1) Baseline model: BaseModel <- lme4::lmer(Dep ~ Age + Cov + (Age|ID), data=Df) (2) Extract EBEs: Y0 <- ranef(BaseModel)$ID[["Intercept"]]; Y1 <- ranef(BaseModel)$ID[["Age"]] (3) Estimate N-EBEs: B0 <- lm(Y0 ~ SNP); B1 <- lm(Y1 ~ SNP)

However, it has been shown that N-EBEs are shrunk, and to obtain unbiased estimates, a correction should be applied (https://academic.oup.com/bib/article/22/3/bbaa130/5868073). This correction requires 3 things:

(1) the number of observations per participant: Ni (2) the variance covariance matrix of the random effects from the base model: Rhat <- matrix(as.numeric(lme4::VarCorr(BaseModel)[1:2,1:2]),ncol=2)

     0.4059    -0.0098
   -0.0098   6.0257

(3) the residual standard deviation from the base model: Gi <- diag(lme4::sigma(BaseModel)^2, Ni)

   sigma(BaseModel)
   [1]  0.7199

The estimates detailed above come from the BaseModel where a scaled sum-score is used to measure Depression symptoms. Now, I have been exploring the idea of using multlcmm as my BaseModel to obtain a more reliable latent depression symptom score.

(1) Baseline model: BaseModel <- multlcmm(DepI1+DepI2+DepI3+DepI4 ~ Age+Cov, random=~1+Age, data=Df, subject="ID", link="threholds", methInteg="QMC", nMC=1000, nproc=nCores) (2) Extract EBEs: Y0 <- BaseModel$predRE[,2]; Y1 <- BaseModel$predRE[,3] (3) Estimate N-EBEs: B0 <- lm(Y0 ~ SNP); B1 <- lm(Y1 ~ SNP)

However, when I come to correct the N-EBEs, I am getting very strange results (p-values do not follow a normal distribution). I think it is because I am using the wrong value for the residual standard deviation. I am defining:

(2) the variance covariance matrix of the random effects from the base model: Rhat <- matrix(c(1, VarCovRE(BaseModel)[1,1], VarCovRE(BaseModel)[1,1], VarCovRE(BaseModel)[2,1]),ncol=2)

   1.0000    -0.51428
   -0.51428  18.83844

   (This matches the variance covariance matrix from summary(BaseModel))

(3) the residual standard deviation (variance) from the base model:

get model residuals and estimate the variance

  Res <- BaseModel$pred[,c(1,2,6)] %>% filter(Yname==unique(.$Yname)[1]) %>% select(resid_ss)
  Gi   <- diag(var(Res), Ni)

  Here, sd(Res) = 0.0832; var(Res) = 0.0069 ...

  When I look at summary(BaseMod)
                                              DepI1      DepI2    DepI3     DepI4
  Residual standard error:   0.9198   0.9968   1.1602    1.1969

So I expect I'm going wrong here. Do you have any thoughts? I'd be happy to share more of the model outputs or the current draft of the paper if you need more details -- and if you think it's an interesting problem/question and you'd be interested in collaborating on the project then we'd certainly benefit from your expertise.

Many thanks in advance! Chris

VivianePhilipps commented 1 year ago

Hi,

I'm really sorry, the residuals should not appear with thresholds link. Indeed, with ordinal data the computation is not the same as for continuous outcomes. I will remove it quickly, the results you get are false (only the predictions in $pred and $predRE. The estimations are fine).

The equivalent of lme4::sigma(BaseModel) are the residual standard errors from the summary. But here you have 4 errors because you model 4 outcomes. I don't know if your correction method can be extended to consider multiple outcomes.

Best,

Viviane

chrisraynerr commented 1 year ago

Hi Viviane, thanks again for your response - very much appreciated!

Just so that I am clear - are you saying that the predicted outputs in $pred and $predRE are false? And that we can't estimate the random intercept and slope parameters (of the latent trait) from the multivariate model?

So, perhaps my only option is to use multlcmm to estimate the discrimination and threshold parameters, then create factor scores using mirtCAT::generate.mirt_object() and mirt::fscores()... then I can use the factor scores in lme4::lmer to estimate the random effects parameters and the correction parameters?

If Age is included as my time variable in the multlcmm model, can it also be included as the time variable in the lme4::lmer factor score model?

Many thanks! Chris

VivianePhilipps commented 1 year ago

Hi Chris,

the outputs in $pred and $predRE are wrong. The last commit in the master branch puts NA instead of these wrong values. The estimated model is correct, even with random effects, but the predictions are not.

I don't know the mirt package, but if it provides subject-specific scores, then this is a solution. In your second step with lmer, you can include Age to take into account the longitudinal evolution, even if the first step already includes Age.

Best,

Viviane

CecileProust-Lima / lcmm

IRT factor scores #176

get model residuals and estimate the variance