Rjacobucci / regsem

Regularized Structural Equation Modeling
14 stars 8 forks source link

fit_indices.R: CV uses test/hold-out data in the training sample? #2

Closed jake-westfall closed 7 years ago

jake-westfall commented 7 years ago

I was looking over the code in fit_indices.R to see how exactly you carry out cross-validation of SEMs, since no details are really given in the paper. Now, if I'm reading the code correctly, it looks like there are two methods of computing cross-validated fit indices, which we get by either setting CV = TRUE or CV = "boot". But it looks like, either way, the cross-validation is not really done properly, because either way it appears to compare the implied covariance matrix from the full model ( i.e., the model fitted to the entire dataset) to a sample covariance matrix from a subset of the data. The problem here is then that the training and test sets overlap: the subset/hold-out data is also part of the training set.

In the case of CV = TRUE, the sample covariance matrix is simply supplied by the user. So, it is at least possible in principle that the CV could be done correctly (i.e., with non-overlapping sets), provided that the user (1) splits the data beforehand, (2) computes the sample covariance matrix in the test set, (3) fits the model to the training set, and (3) calls fit_indices() on the fitted model, with CV = TRUE, and supplying the covariance matrix from step (2). However, it seems unlikely in practice that the user will do this. More likely, the user will fit the model to the full dataset, and then call fit_indices() on that model with CV = TRUE and passing in a sample covariance matrix for a subset of the data, leading to the overlap problem. For all I know, this is what you did in the paper (there are not enough details to tell).

In the case of CV = 'boot', there is no possibility that the CV is done correctly. You can see in the code that the sample covariance matrices are formed by taking random subsets of the dataset used to fit the full model. So there will always be overlap between the training and test sets. This is not proper cross-validation.

Please let me know if I have misunderstood the code.

Rjacobucci commented 7 years ago

Hi Jake,

Yes, you're right about how the bootstrapping is incorrect. That part of the code is a holdover from the beginning, supplanted by what is included in cv_regsem. Using fit.ret2=, users can specify the use of k-fold or bootstrapping. Thanks for pointing this out -- I'm just going to remove it.

And yes, you're correct in that users could incorrectly have overlap between the data the model was run on and the holdout. I have a note about this in the covMat argument, but it is worth adding more.

jake-westfall commented 7 years ago

Cool. BTW, cool work on this topic. Regularized SEM was a topic I had been thinking about for a little while and was just about to start studying in more detail, and then your paper appeared, which suddenly made my project seem, shall we say, less urgent ;). Although now I've recently left academia so it doesn't matter too much anyway I guess. Anyway, cheers.