abess-team / abess

Fast Best-Subset Selection Library
https://abess.readthedocs.io/
Other
474 stars 41 forks source link

different results on Mac, Linux, Windows #529

Closed bdesmarais closed 1 year ago

bdesmarais commented 1 year ago

Overview

Thanks for the great package! My collaborator and I have replicated this problem across multiple machines. We get different results when running the following code on Mac (finds a best-fitting model with 34 variables) as compared to Linux and Windows machines (finds a model with 32 variables). You can get the data object at this link, https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/bbd5087_psu_edu/EU74wxpEuqtLiQ3BVw1-gt8B-gdIe7iAUZIxhPbWuW90_A?e=X8Q0Z9

Code for Reproduction

library(abess)

# read in the data
load("yx_home.RData")

# run abess/summarize
abess_res <- abess(yx[, -1], yx[, 1])
summary(extract(abess_res))
# support.vars is of length 32 on unix/windows, and 34 on Mac

Expected behavior

support.vars is of length 32 on unix/windows, and 34 on Mac

bbayukari commented 1 year ago

I'm very sorry for taking such a long time to reply to you. Your question did indeed trouble me for quite some time.

Let me start with the conclusion: there are no bugs in the program, and the differing results are within the normal range.

One way to improve the results is by specifying an appropriate value for the parameter lambda in abess. Additionally, removing highly correlated variables can be also helpful, particularly c500_c694, c640_c694, c220_c694, c750_c694, c92_c694, c235_c694, c811_c694, etc., when only one sample is non-zero for each of them.

Below is the specific reason behind the differing results: When selecting s variables, abess requires that there isn't excessive multicollinearity among any 2s variables. For instance, if we have 1000 samples and need to select 50 out of 10000 variables, even though there will inevitably be multicollinearity issues involving not more than 1000 variables in the data, abess typically works correctly. However, if variable1 and variable2 have a (nearly) perfect correlation (a multicollinearity issue involving only 2 variables), the choice of variable1 over variable2 by abess will depend on the underlying numerical computation. It's worth noting that either choice is correct, and abess does not perform any additional checks in this case.

In fact, I have observed the phenomenon where the line of code "Eigen::VectorXd beta_full = XTX.ldlt().solve(XTy);" yields different results on macOS and Windows. This is due to the fact that the matrix XTX is singular, so the solution is not unique. This aligns with Eigen's documentation: "This method just tries to find as good a solution as possible." Since abess is an iterative algorithm, it's foreseeable that subsequent results may differ.

Finally, thank you for your feedback on the abess project. If you have any further questions or need to continue the discussion, please feel free to reach out. Your input is greatly appreciated.

bdesmarais commented 1 year ago

Thanks for looking into this. I appreciate the detailed response! In the pipeline we're developing we'll implement some additional screening to head this off before running ABESS.