FloSchuberth / cSEM

Composite-based SEM
https://floschuberth.github.io/cSEM/
GNU General Public License v3.0
28 stars 8 forks source link

Repeated indicators break calculation of GoF and other indices #466

Closed ihrke closed 2 years ago

ihrke commented 2 years ago

Using the exact example taken from the csem() documentation, it is not possible to calculate the GoF indices using the assess() function. The most frequenly encountered error is

Error in solve.default(C[followers, followers, drop = FALSE]) : 
  system is computationally singular: reciprocal condition number = 3.04917e-17

I assume it is because of the multicollinearity in the design matrix induced by adding renamed versions of the repeated indicator variables (the y11_temp variables). Is there a way to fix that?

Here is code to produce the error (copy and pasted from the example section of csem()):

library(cSEM)
# The standard repeated indicators approach is done by 1.) respecifying the model
# and 2.) adding the repeated indicators to the data set

# 1.) Respecify the model
model_RI <- "
# Path model / Regressions 
c4   ~ eta1
eta2 ~ eta1 + c4
c4   ~ c1 + c2 + c3

# Reflective measurement model
c1   <~ y11 + y12 
c2   <~ y21 + y22 + y23 + y24
c3   <~ y31 + y32 + y33 + y34 + y35 + y36 + y37 + y38
eta1 =~ y41 + y42 + y43
eta2 =~ y51 + y52 + y53

# c4 is a common factor measured by composites
c4 =~ y11_temp + y12_temp + y21_temp + y22_temp + y23_temp + y24_temp +
      y31_temp + y32_temp + y33_temp + y34_temp + y35_temp + y36_temp + 
      y37_temp + y38_temp
"

# 2.) Update data set
data_RI <- dgp_2ndorder_cf_of_c
coln <- c(colnames(data_RI), paste0(colnames(data_RI), "_temp"))
data_RI <- data_RI[, c(1:ncol(data_RI), 1:ncol(data_RI))]
colnames(data_RI) <- coln

# Estimate
res_RI <- csem(data_RI, model_RI)
summarize(res_RI)
assess(res_RI) ## <- this line generates the error
FloSchuberth commented 2 years ago

Dear @ihrke,

first of all, thank you very much for using cSEM :)

When we implemented the approaches to deal with second-order constructs, we spent a lot of time thinking about implementing the repeated indicators approach. In fact, in some earlier version of cSEM the repeated indicators approach was implemented. However, the repeated indicators approach comes a long with several problems. The biggest problem is that the model that you actually want to estimate is modified because of the repeated indicators. As a consequence, the indicators' model-implied variance-covariance matrix of the repeated indicators model differs from the variance-covariance matrix implied by the model that you actually want to estimate. This matrix is used for the calculation of various metric such as the logL, the SRMR, etc. To overcome this problem, one must remove the rows/columns belonging to the repeated indicators from that matrix. Moreover, since the indicators are already used in the model you face often problems of perfect multicollinearity. Against this background, we have stopped pursuing the implementation of the repeated indicators approach (in fact the current version of cSEM does not contain the repeated indicators approach anymore) and recommend to use the two/three-stage approach:

model_ts <- "

Path model / Regressions

c4 ~ eta1 eta2 ~ eta1 + c4 c4 <~ c1 + c2 + c3

Reflective measurement model

c1 <~ y11 + y12 c2 <~ y21 + y22 + y23 + y24 c3 <~ y31 + y32 + y33 + y34 + y35 + y36 + y37 + y38 eta1 =~ y41 + y42 + y43 eta2 =~ y51 + y52 + y53 "

res_ts <- csem(data_RI, model_ts,.disattenuate = T) verify(res_ts)

For the two/three-stage approach, the assess function works properly, see assess(res_ts)

The benefits of a proper implementation, i.e., automatically adding the repeated indicators, removing them afterwards and adjust all the outputs is from our perspective not worth the effort.

Your model also shows some problems. For example, in your model you specify c4 as a common factor (=~) and therefore csem applies by default a correction for attenuation (PLSc). However, you specified the first-order constructs as composites. The question arises, are the indicators measurement error prone manifestations or not? Currently, you treat them as measurement error free variables for your first-order constructs, but as measurement error prone variables for the second-order construct. Obviously, this is a contradiction.

Moreover, the construct variance-covariance matrix is not positive semi-definite, see:

verify(res_RI)


Verify admissibility:

 inadmissible

Details:

Code Status Description 1 ok Convergence achieved
2 ok All absolute standardized loading estimates <= 1
3 not ok Construct VCV is positive semi-definite
4 ok All reliability estimates <= 1
5 not ok Model-implied indicator VCV is positive semi-definite


As a side effect the R^2 value of c4 is larger than 1:

res_RI$Estimates$R2 c4 eta2 1.1789707 0.2761084

This also creates problems because R2 are expected to be between 0 and 1.

In general, the assess function calls various "calculate" functions to obtain certain statistics. By default the argument .quality_criterion in the assess function is set to "all" and therefore, "everything" is calculated. However, some functions simply break, e.g., because the R2 is above 1 or of issues with the model-implied variance-covariance matrix of the indicators. See the following: assess(.object = res_RI,.quality_criterion = "aic") #error assess(.object = res_RI,.quality_criterion = "aicc") #error assess(.object = res_RI,.quality_criterion = "aicu") #error assess(.object = res_RI,.quality_criterion = "bic") #error assess(.object = res_RI,.quality_criterion = "fpe") #error assess(.object = res_RI,.quality_criterion = "gm") #error
assess(.object = res_RI,.quality_criterion = "hq") #error assess(.object = res_RI,.quality_criterion = "hqc") #error assess(.object = res_RI,.quality_criterion = "mallows_cp") #error assess(.object = res_RI,.quality_criterion = "ave") #no error assess(.object = res_RI,.quality_criterion = c("rho_C","rho_C", "rho_C_mm", "rho_C_weighted", "rho_C_weighted_mm")) # no error assess(.object = res_RI,.quality_criterion = "ave") # no error assess(.object = res_RI,.quality_criterion = c("effects")) # no error assess(.object = res_RI,.quality_criterion = c("f2")) # no error assess(.object = res_RI,.quality_criterion = c("fl_criterion")) # no error assess(.object = res_RI,.quality_criterion = c("chi_square")) # no error assess(.object = res_RI,.quality_criterion = c("chi_square_df")) # no error assess(.object = res_RI,.quality_criterion = c("cfi")) # no error assess(.object = res_RI,.quality_criterion = c("cn")) # no error assess(.object = res_RI,.quality_criterion = c("gfi")) # no error assess(.object = res_RI,.quality_criterion = c("ifi")) # no error assess(.object = res_RI,.quality_criterion = c("nfi")) # no error assess(.object = res_RI,.quality_criterion = c("nnfi")) # no error assess(.object = res_RI,.quality_criterion = c("reliability")) # no error assess(.object = res_RI,.quality_criterion = c("rmsea")) # no error assess(.object = res_RI,.quality_criterion = c("rms_theta")) # no error assess(.object = res_RI,.quality_criterion = c("srmr")) # no error assess(.object = res_RI,.quality_criterion = c("gof")) #not printed, no error assess(.object = res_RI,.quality_criterion = c("htmt")) # no error assess(.object = res_RI,.quality_criterion = c("htmt2")) # no error assess(.object = res_RI,.quality_criterion = c("r2")) # no error assess(.object = res_RI,.quality_criterion = c("r2_adj")) # no error assess(.object = res_RI,.quality_criterion = c("rho_T")) # no error assess(.object = res_RI,.quality_criterion = c("rho_T_weighted")) # no error assess(.object = res_RI,.quality_criterion = c("vif")) # no error assess(.object = res_RI,.quality_criterion = c("vifmodeB")) # no error

Of course, even though most of the functions work technically, you should interpret the results very cautiously because you evaluate the model containing the repeated indicators and not the model likely you have in mind.

HTH

Best regards, Florian

Currently, some statistics can simply not be calculated.

ihrke commented 2 years ago

Thank you for the feedback and the hard work you put into developing the package - it is highly appreciated! The model I posted the code for was not my own model but just an example that I copy and pasted from the documentation (in the help(csem) page) so I did not check it thoroughly.

The reason I was trying to implement the repeated indicators approach in the first place is that I got inadmissibility problems with the 2-stage approach when trying to fit my own target model:

Verify admissibility:

     inadmissible

Details:

  Code   Stage 1   Stage 2   Description
  1      ok        ok        Convergence achieved                                   
  2      ok        not ok    All absolute standardized loading estimates <= 1       
  3      ok        ok        Construct VCV is positive semi-definite                
  4      ok        not ok    All reliability estimates <= 1                         
  5      ok        ok        Model-implied indicator VCV is positive semi-definite  

This issue was solved when using repeated indicators (i.e., assess(res) gave no problems) but then I couldn't use all the diagnostic functions. To fix this, I wrote some code to remove the repeated indicators from the fitted model object:

sniptemp <- function(l){
  lclass=class(l)
  ll=lapply(l, function(x) {
    if(is.list(x)){
      return(sniptemp(x))
    } else {
      if(is.matrix(x)){
        rn=rownames(x)
        cn=colnames(x)
        rix=str_ends(rn, "_temp", negate = T)
        cix=str_ends(cn, "_temp", negate = T)
        x=x[rix,cix]
      } else if(is.character(x)){
        ix=str_ends(x, "_temp", negate = T)
        x=x[ix]
      } else if(is.numeric(x)){
        xnames=names(x)
        ix=str_ends(xnames, "_temp", negate = T)
        x=x[ix]
      }
      # snip away the _t variables
      return(x)
    }
  })
  class(ll) <- lclass
  return(ll)
}

However, when doing this, I got a different set of problems:

Error in solve.default(diag(sqrt(var_proxies), nrow = length(var_proxies),  : 
  Lapack routine dgesv: system is exactly singular: U[1,1] = 0

So at the moment I have no working version.

Best,

Matthias

FloSchuberth commented 2 years ago

Dear @ihrke ,

what type of second-order construct are you studying? Common factor of composite or common factor of common factors?

I am not aware of any literature that has shown that the repeated indicators approach produces consistent estimates for models containing second-order common factors, even if you apply a correction for attenuation (as done in PLSc). Be aware that by default cSEM applies PLSc and thus corrects for attenuation bias. As said, if this works in case of the repeated indicators approach is still up to future research. Don't be confused by PLS literature that refers to the repeated indicators approach and mentions its suitability for second-order construct. This literature typically relies on PLS terminology, i.e., reflective simply means to use mode A to calculate PLS weights.

Considering your function. Do you overwrite parts of the csem object? Or do you just apply it to the csem output? If you replace parts of the csem object, this can create problems. For instance, for some assess functions the model needs to be reestimated. If you now, for example replace, the correlation matrix containing the repeated indicators, but not the model syntax then you will get an error, because the two do not match. Perhaps you can post a working example of your function.

Best regards, Florian

ihrke commented 2 years ago

Dear @FloSchuberth,

thanks a lot for your detailed input, it is really helpful - in particular because I am a newcomer to this method. My target model looks like this:

model="
# Structural model
BrainVol ~  Trauma + HADS  + Age + Sex + Cognitive
Trauma ~ Sex + HADS

BrainVol =~ Hippocampus+Amygdala+Thalamus
HADS =~ Depression + Anxiety

# Formative model for LV 'Trauma'
Trauma <~ illness_accident + violence + sexual_abuse + bullied + witnessed + frightening + death_close + painful_hospital + close_illness

# Reflective measurement model for the brain areas and cognition
Hippocampus =~ mr_lh_hippocampus + mr_rh_hippocampus
Thalamus =~ mr_lh_thalamus + mr_rh_thalamus
Amygdala =~ mr_lh_amygdala + mr_rh_amygdala

Cognitive =~ mmstotal_score + sum_recognition

# Reflective measurement model for Anxiety/Depression
Anxiety =~ hads_tense + hads_awful + hads_worry + hads_relaxed + hads_butterflies + hads_restless + hads_panic
Depression =~ hads_enjoy + hads_laugh + hads_cheerful + hads_slowed_down + hads_appearance + hads_look_forward + hads_enjoy_book_tv

# Single-item constructs to use in structural model
Age <~ age
Sex <~ sex
"

My goal is to relate Trauma (formative) and Depression to structural changes in different brain regions (here summarized through a second-order construct "BrainVol"). This model gives me inadmissibility warnings as detailed above at stage-2 (All absolute standardized loading estimates <= 1 and All reliability estimates <= 1). I tried both with and without the .disattenuate option and it fails in both cases. Re-coding the same model with repeated indicators gets rid of these inadmissibilities.

Regarding my function sniptemp(res) from above, it simply removes variables with the name ending with _temp from all matrices, character vectors and numeric vectors by "walking" through the quite complex cSEMResults structure.

Thanks again for your help!

Best,

Matthias

FloSchuberth commented 2 years ago

Dear @ihrke,

considering your function, this will not work as much more of the csem output must be modified, e.g., degrees of freedom, character strings that describe your original model, etc. Therefore, I think sticking to cSEM's original assess function will be a fruitless endeavor. What you could do, try writing your own functions/copy and paste cSEM's functions and adjust them to calculate the desired statistics. However, from my past experience, it is not always clear how certain statistics should be calculated in the case of the repeated indicators approach. That was also the reason why did not pursue this approach.

Considering your model, even though it is not directly related to cSEM, I am not sure whether cSEM does what you have in mind. It seems that you assume Trauma is a latent variable which is caused by a set of variables. Be aware the literature is not really clear about formative measurement. It can either mean, you have a latent variable which is causally affected by a set of variables (causal-formative measurement model; Bollen & Bauldry, 2011) or it means that you have composite that is composed by a set of variables (which I would not call measurement). For the former, you need to have at least two consequences to ensure that the model is identified. Currently, you only have BrainVol which is not sufficient. In cSEM, if you use <~ a composite/emergent variable is used for which the weights are calculated by Mode B. Consequently, not a causal-formative measurement model is estimated. Your example reminds me a bit an example I have seen recently in Hwang et al. (2021) who employed composites. You might want to have a look. If you want to use an emergent variable/composite/component for Trauma, you are not limited to PLS(c), as shown by Hwang et al. (2021) you can also use IGSCA (not sure whether it currently allows for estimating models containing second-order composites; you might want to have a look at GSCAPro https://www.gscapro.com). Moreover, I recently presented a specification that allows to employ commonly used SEM estimators to estimate models containing composites (Schuberth, in press). Hence, modeling second-order constructs shouldn't be a problem using this approach.

Considering your second-order construct, you are facing a common factor of common factors. For this case, I must admit, PLS is not the best estimator. At least, I also experienced the problem of reliability estimates above one when I tested this approach. This approach is not published yet. We just implemented it because it was so straightforward as it very similar to an approach proposed by VanRiel et al. (2017). Be aware that the typical PLS literature does not care a lot about measurement errors (i.e., they do not correct for attenuation, but simply use Mode A to calculate the weights used to calculate construct scores), see Becker et al. (2012) and Sarstedt et al. (2019). So this literature can be very confusing because it give the impression that a reflective model is estimated, while in fact a composite model is estimated. In contrast, in cSEM we apply a correction for attenuation when the =~ operator is used.

HTH

Best regards, Florian

References J.-M. Becker, K. Klein, and M. Wetzels, “Hierarchical latent variable models in PLS-SEM: Guidelines for using reflective-formative type models,” Long Range Planning, vol. 45, no. 5, Art. no. 5, 2012.

K. A. Bollen and S. Bauldry, “Three Cs in measurement models: Causal indicators, composite indicators, and covariates,” Psychological Methods, vol. 16, no. 3, Art. no. 3, 2011.

H. Hwang et al., “An approach to structural equation modeling with both factors and components: Integrated generalized structured component analysis,” Psychological Methods, vol. 26, no. 3, Art. no. 3, 2021.

M. Sarstedt, J. F. Hair, J.-H. Cheah, J.-M. Becker, and C. M. Ringle, “How to specify, estimate, and validate higher-order constructs in PLS-SEM,” Australasian Marketing Journal (AMJ), vol. 27, no. 3, Art. no. 3, Aug. 2019, doi: 10.1016/j.ausmj.2019.05.003.

F. Schuberth, “The Henseler-Ogasawara specification of composites in structural equation modeling: A tutorial,” Psychological Methods.

A. C. R. Van Riel, J. Henseler, I. Kemény, and Z. Sasovova, “Estimating hierarchical constructs using Partial Least Squares: The case of second order composites of factors,” Industrial Management & Data Systems, vol. 117, no. 3, Art. no. 3, 2017.