aalfons / robustHD

Robust methods for high-dimensional data, in particular linear model selection techniques based on least angle regression and sparse regression.

GNU General Public License v3.0

10 stars 6 forks source link

Computational time issue for cross-validation with sparseLTS() #37

Closed valentint closed 11 months ago

valentint commented 2 years ago

I experience speed issues when conducting cross-validation with sparseLTS() - about five minutes for a grid of ten values of lambda - on a relatively good laptop. I could guess that the problem is the RAM - "only" 8GB.

Below is the script with the times that I observed (approximate, with tictoc) and my session info, and the system info.

library("robustHD") library(tictoc)

data("nci60")

y <- protein[, 92] correlations <- apply(gene, 2, corHuber, y) keep <- partialOrder(abs(correlations), 100, decreasing = TRUE) X <- gene[, keep]

First I run sparseLTS() with the sequence of 10 lamdas, as in the example on GitHub - it takes approx 5 minutes.

lambda <- seq(0.01, 0.5, length.out=10) tic() fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507) toc()

319.12 sec elapsed

=====================================================================

sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] tictoc_1.0.1 robustHD_0.7.1 robustbase_0.93-9 perry_0.3.0
[5] ggplot2_3.3.5

loaded via a namespace (and not attached): [1] Rcpp_1.0.7 fansi_0.5.0 withr_2.4.2 crayon_1.4.1
[5] utf8_1.2.2 MASS_7.3-54 grid_4.1.1 R6_2.5.1
[9] lifecycle_1.0.1 gtable_0.3.0 magrittr_2.0.1 scales_1.1.1
[13] pillar_1.6.3 rlang_0.4.11 vctrs_0.3.8 ellipsis_0.3.2
[17] tools_4.1.1 glue_1.4.2 DEoptimR_1.0-9 munsell_0.5.0
[21] compiler_4.1.1 pkgconfig_2.0.3 colorspace_2.0-2 tibble_3.1.5

======================================== System Info

Device name DESKTOP-1NPTK8M Processor Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz 2.81 GHz Installed RAM 8.00 GB System type 64-bit operating system, x64-based processor

Edition Windows 10 Pro Version 21H1 Installed on 10/31/2020 OS build 19043.1237

aalfons commented 2 years ago

This is what I get on my Mac:

> library("robustHD")
Loading required package: ggplot2
Use suppressPackageStartupMessages() to eliminate package startup
messages
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
> library(tictoc)
> 
> data("nci60")
> 
> y <- protein[, 92]
> correlations <- apply(gene, 2, corHuber, y)
> keep <- partialOrder(abs(correlations), 100, decreasing = TRUE)
> X <- gene[, keep]
> 
> #First I run sparseLTS() with the sequence of 10 lamdas, as in the example on GitHub - it takes approx 5 minutes.
> lambda <- seq(0.01, 0.5, length.out=10)
> tic()
> fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507)
> toc()
24.97 sec elapsed

R session information:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] tictoc_1.0.1      robustHD_0.7.1    robustbase_0.93-8 perry_0.3.0      
[5] ggplot2_3.3.5    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       magrittr_2.0.1   MASS_7.3-54      tidyselect_1.1.1
 [5] munsell_0.5.0    colorspace_2.0-2 R6_2.5.1         rlang_0.4.11    
 [9] fansi_0.5.0      dplyr_1.0.7      tools_4.1.1      grid_4.1.1      
[13] gtable_0.3.0     utf8_1.2.2       withr_2.4.2      ellipsis_0.3.2  
[17] tibble_3.1.4     lifecycle_1.0.0  crayon_1.4.1     purrr_0.3.4     
[21] vctrs_0.3.8      glue_1.4.2       compiler_4.1.1   DEoptimR_1.0-9  
[25] pillar_1.6.2     generics_0.1.0   scales_1.1.1     pkgconfig_2.0.3

Hardware Overview:

Model Name: MacBook Pro Model Identifier: MacBookPro14,1 Processor Name: Dual-Core Intel Core i5 Processor Speed: 2.3 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Hyper-Threading Technology: Enabled Memory: 16 GB

System Software Overview:

System Version: macOS 11.6 (20G165) Kernel Version: Darwin 20.6.0

aalfons commented 2 years ago

And this on my Windows machine:

> library("robustHD")
Loading required package: ggplot2
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
> library(tictoc)
> 
> data("nci60")
> 
> y <- protein[, 92]
> correlations <- apply(gene, 2, corHuber, y)
> keep <- partialOrder(abs(correlations), 100, decreasing = TRUE)
> X <- gene[, keep]
> 
> #First I run sparseLTS() with the sequence of 10 lamdas, as in the example on GitHub - it takes approx 5 minutes.
> lambda <- seq(0.01, 0.5, length.out=10)
> tic()
> fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507)
> toc()
54.48 sec elapsed

R session information:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Netherlands.1252  LC_CTYPE=English_Netherlands.1252   
[3] LC_MONETARY=English_Netherlands.1252 LC_NUMERIC=C                        
[5] LC_TIME=English_Netherlands.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tictoc_1.0.1      robustHD_0.7.1    robustbase_0.93-8 perry_0.3.0       ggplot2_3.3.5    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       magrittr_2.0.1   MASS_7.3-54      tidyselect_1.1.1 munsell_0.5.0   
 [6] colorspace_2.0-2 R6_2.5.1         rlang_0.4.11     fansi_0.5.0      dplyr_1.0.7     
[11] tools_4.1.1      grid_4.1.1       gtable_0.3.0     utf8_1.2.2       DBI_1.1.1       
[16] withr_2.4.2      ellipsis_0.3.2   assertthat_0.2.1 tibble_3.1.4     lifecycle_1.0.0 
[21] crayon_1.4.1     purrr_0.3.4      vctrs_0.3.8      glue_1.4.2       DEoptimR_1.0-9  
[26] compiler_4.1.1   pillar_1.6.2     generics_0.1.0   scales_1.1.1     pkgconfig_2.0.3

Hardware information:

Processor Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz 2.50 GHz Installed RAM 16,0 GB (15,9 GB usable) System type 64-bit operating system, x64-based processor

System software information:

Edition Windows 10 Home Version 20H2 Installed on ‎15/‎05/‎2021 OS build 19042.1237 Experience Windows Feature Experience Pack 120.2212.3530.0

aalfons commented 2 years ago

So the processor on my machines should be slower (i5 vs i7, lower frequency), but they have twice as much RAM. Not sure if the RAM makes the difference, but I'll see if I can get a hold of a laptop with 8GB RAM to reproduce the issue.

Another possibility would be BLAS/LAPACK, but I don't have any special BLAS/LAPACK libraries on my machines, just the default ones.

valentint commented 2 years ago

I have experienced cases when the speed depends dramatically on the available physical RAM, however with very large data sets. Now we have a data set 59x100 only, after the gene selection.

I tried with the example from the sparseLTS help, modified p to be equal to 100, increased the grid for lambda to seq(0.1, 0.5, length.out=10), and run it twice - once with the default crit=BIC and once with crit=PE.

Below is my script with the timings.

Similarly, as with the nci60 example, the time grows a lot (five times) when crit="PE". Is it possible that the resampling for computing PE is memory-hungry?

install.packages("microbenchmark")

library(microbenchmark) library(robustHD)

generate data

the example is not high-dimensional to keep computation time low

library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 100 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilonn) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points

frac <- seq(0.1, 0.5, length.out=10) mb <- microbenchmark(single=sparseLTS(x, y, lambda = 0.05, mode = "fraction"), BIC=sparseLTS(x, y, lambda = frac, mode = "fraction"), PE=sparseLTS(x, y, lambda = frac, mode = "fraction", crit = "PE"), times=5) print(mb, signif=3)

Unit: seconds

expr min lq mean median uq max neval

single 3.8 4.06 4.06 4.09 4.15 4.22 5

BIC 13.4 13.60 13.60 13.60 13.70 13.70 5

PE 64.9 65.30 65.80 65.80 66.30 66.40 5

On Thu, Oct 14, 2021 at 1:56 PM Andreas Alfons @.***> wrote:

So the processor on my machines should be slower (i5 vs i7, lower frequency), but they have twice as much RAM. Not sure if the RAM makes the difference, but I'll see if I can get a hold of a laptop with 8GB RAM to reproduce the issue.

Another possibility would be BLAS/LAPACK, but I don't have any special BLAS/LAPACK libraries on my machines, just the default ones.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aalfons/robustHD/issues/37#issuecomment-943285742, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOHVCPRIMBG2MEH5HCBE63UG3AP5ANCNFSM5F6ACN2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

aalfons commented 2 years ago

It's not surprising that the computation time grows by a factor 5 if crit = "PE' compared to crit = "BIC". By default, the former used 5-fold cross-validation is used to estimate the prediction error, so we fit the method to five different training sets containing 80% of the observations, and there is a final run on the full data set with the optimal value of the penalty parameter.

So I don't think that crit = "PE" is particularly memory-hungry, but I can try to do some memory profiling.

I find it more curious that this example takes much less time than the NCI-60 example (65.8 second vs 319.12 seconds) even though the data in this example contain more observations and the same number of variables. I'll run this example on my Windows machine to see how the timings compare.

aalfons commented 2 years ago

This is what I got:

> library(microbenchmark)
> library(robustHD)
Loading required package: ggplot2
Want to understand how all the pieces fit together? Read R for Data Science:
https://r4ds.had.co.nz/
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
> 
> ## generate data
> ## the example is not high-dimensional to keep computation time low
> library("mvtnorm")
> set.seed(1234)  # for reproducibility
> n <- 100  # number of observations
> p <- 100   # number of variables
> beta <- rep.int(c(1, 0), c(5, p-5))  # coefficients
> sigma <- 0.5      # controls signal-to-noise ratio
> epsilon <- 0.1    # contamination level
> Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p))
> x <- rmvnorm(n, sigma=Sigma)    # predictor matrix
> e <- rnorm(n)                   # error terms
> i <- 1:ceiling(epsilon*n)       # observations to be contaminated
> e[i] <- e[i] + 5                # vertical outliers
> y <- c(x %*% beta + sigma * e)  # response
> x[i,] <- x[i,] + 5              # bad leverage points
> 
> 
> frac <- seq(0.1, 0.5, length.out=10)
> mb <- microbenchmark(single=sparseLTS(x, y, lambda = 0.05, mode =
+                                           "fraction"),
+                      BIC=sparseLTS(x, y, lambda = frac, mode = "fraction"),
+                      PE=sparseLTS(x, y, lambda = frac, mode = "fraction",
+                                   crit = "PE"), times=5)
> print(mb, signif=3)
Unit: seconds
   expr   min    lq  mean median    uq   max neval
 single  1.18  1.18  1.30   1.18  1.26  1.67     5
    BIC  7.24  7.25  8.08   7.39  8.50 10.00     5
     PE 31.00 31.20 32.30  32.70 33.10 33.70     5

The computation time on my machine is a lot closer to that on your machine compared to the NCI-60 example.

valentint commented 2 years ago

yes, but you are still two times faster, which should not be due to the RAM for such a small data set. I will repeat the same on a machine similar to mine which I have next door and will let you know.

Best V

On Fri, Oct 15, 2021 at 1:05 PM Andreas Alfons @.***> wrote:

This is what I got:

library(microbenchmark) library(robustHD) Loading required package: ggplot2 Want to understand how all the pieces fit together? Read R for Data Science:https://r4ds.had.co.nz/ Loading required package: perry Loading required package: parallel Loading required package: robustbase

generate data

the example is not high-dimensional to keep computation time low

library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 100 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilonn) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points

frac <- seq(0.1, 0.5, length.out=10) mb <- microbenchmark(single=sparseLTS(x, y, lambda = 0.05, mode =

"fraction"),

BIC=sparseLTS(x, y, lambda = frac, mode = "fraction"),

PE=sparseLTS(x, y, lambda = frac, mode = "fraction",

crit = "PE"), times=5) print(mb, signif=3) Unit: seconds expr min lq mean median uq max neval single 1.18 1.18 1.30 1.18 1.26 1.67 5 BIC 7.24 7.25 8.08 7.39 8.50 10.00 5 PE 31.00 31.20 32.30 32.70 33.10 33.70 5

The computation times of my machine are a lot closer to that on your machine compared to the NCI-60 example.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aalfons/robustHD/issues/37#issuecomment-944210442, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOHVCP2CY5VPZCCBUK2RYTUHADF3ANCNFSM5F6ACN2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

aalfons commented 11 months ago

As I cannot reproduce the issue and there has been no recent activity in this discussion, I'll close it now.