Closed jake-westfall closed 7 years ago
Hi Jake,
Yes, you're right about how the bootstrapping is incorrect. That part of the code is a holdover from the beginning, supplanted by what is included in cv_regsem. Using fit.ret2=, users can specify the use of k-fold or bootstrapping. Thanks for pointing this out -- I'm just going to remove it.
And yes, you're correct in that users could incorrectly have overlap between the data the model was run on and the holdout. I have a note about this in the covMat argument, but it is worth adding more.
Cool. BTW, cool work on this topic. Regularized SEM was a topic I had been thinking about for a little while and was just about to start studying in more detail, and then your paper appeared, which suddenly made my project seem, shall we say, less urgent ;). Although now I've recently left academia so it doesn't matter too much anyway I guess. Anyway, cheers.
I was looking over the code in fit_indices.R to see how exactly you carry out cross-validation of SEMs, since no details are really given in the paper. Now, if I'm reading the code correctly, it looks like there are two methods of computing cross-validated fit indices, which we get by either setting
CV = TRUE
orCV = "boot"
. But it looks like, either way, the cross-validation is not really done properly, because either way it appears to compare the implied covariance matrix from the full model ( i.e., the model fitted to the entire dataset) to a sample covariance matrix from a subset of the data. The problem here is then that the training and test sets overlap: the subset/hold-out data is also part of the training set.In the case of
CV = TRUE
, the sample covariance matrix is simply supplied by the user. So, it is at least possible in principle that the CV could be done correctly (i.e., with non-overlapping sets), provided that the user (1) splits the data beforehand, (2) computes the sample covariance matrix in the test set, (3) fits the model to the training set, and (3) callsfit_indices()
on the fitted model, withCV = TRUE
, and supplying the covariance matrix from step (2). However, it seems unlikely in practice that the user will do this. More likely, the user will fit the model to the full dataset, and then callfit_indices()
on that model withCV = TRUE
and passing in a sample covariance matrix for a subset of the data, leading to the overlap problem. For all I know, this is what you did in the paper (there are not enough details to tell).In the case of
CV = 'boot'
, there is no possibility that the CV is done correctly. You can see in the code that the sample covariance matrices are formed by taking random subsets of the dataset used to fit the full model. So there will always be overlap between the training and test sets. This is not proper cross-validation.Please let me know if I have misunderstood the code.