Multivariate estimate_betas method should fit a separate model for each outcome

ericbair-sciome commented 7 months ago

Currently the multivariate estimate_betas method transforms the data and then fits a lasso model to all outcomes simultaneously. The consensus was that it is probably better to fit a separate model to each outcome. It would be worthwhile to do some quick testing to confirm that it doesn't meaningfully change the results.

ericbair-sciome commented 3 months ago

@kyle-messier you can correct me if I am wrong, but if I am not mistaken, the Vecchia-based iid transformation (using the transformd_iid function) causes the y's to have mean 0. (I didn't verify that mathematically, but that seems to be the case in every example I have tried, including examples where the original y's have means that are not anywhere close to 0.) Assuming that is the case, then I think fitting separate glmnet models to each outcome does more harm than good. The main benefit that I could see for fitting separate models is that a common model assumes that the intercepts are the same for every outcome, which generally will not be the case. But if the transformation forces all the y's to have mean 0, then that is a moot point. Fitting separate models for each outcome means that we are using only a fraction of the data to fit the models, which increases the variance in model fitting for no benefit.

Based on the paragraph above, I am going to close this issue without implementing it. I have a new version of PrestoGP that fits separate models to each outcome, but as I alluded to above, it seems to do worse than the current version that fits one common model to all outcomes. Feel free to reopen if you disagree.

It's always a great feeling to spend a couple weeks on something like this only to conclude that it was a waste of time. :) But now we know, I guess. More importantly, this means that I should hopefully be able to get the accessor functions and imputation implemented fairly quickly now. I did this first because I didn't want to have to redo the accessor/imputation functions (which obviously depend on the underlying glmnet model) later, but it looks like that will not be an issue.

kyle-messier commented 3 months ago

@ericbair-sciome If the transform_iid is done on the whole set of stacked Y's, which mean 0 standardizes them, then I can see how then separating them out afterwards for regression could be bad. However, if the transform_iid was done individually, then they are all still standardized with respect to their original variable. Nonetheless, I am fine with punting this to a v2.0 release sometime in the unknown future.

ericbair-sciome commented 3 months ago

Recall that the multivariate Vecchia procedure combines all the locs for the different outcomes into one common locs matrix that is then ordered using max-min ordering. To transform the data to iid, we also combine the X matrices into a "super matrix." So I'm not how we would apply transform_iid to each set of X predictors one at a time, since the dimensions would not match. The only way to do it would be to fit a separate Vecchia approximation to each outcome, at which point we are basically fitting a series of univariate models rather than a multivariate model.

In any event, I definitely think we should table this for now until the paper is written. Maybe we can have a discussion at some point in the future about how we might implement this or if it even makes sense to implement this. I have saved a copy of the code on my hard drive just in case we do decide to revisit this some day. I can create a branch on GitHub as well if you want this code preserved for posterity. (It mostly works, but a couple functions still need to be updated, and some of the tests are failing because the results are slightly different.)

kyle-messier commented 3 months ago

@ericbair-sciome Agreed - thanks Eric

NIEHS / PrestoGP

Multivariate estimate_betas method should fit a separate model for each outcome #33