cross validation in R package

Mamba413 commented 2 years ago

Describe the bug The cross validation result is not the same as the result written in R.

The code to reproduce

library(abess)
n <- 100
p <- 200
support.size <- 3
dataset <- generate.data(n, p, support.size, seed = 1)
ss <- 0:10

nfolds <- 5
foldid <- rep(1:nfolds, ceiling(n / nfolds))[1:n]
abess_fit <- abess(dataset[["x"]], dataset[["y"]], 
                   tune.type = "cv", nfolds = nfolds, 
                   foldid = foldid, support.size = ss, num.threads = 1)

cv <- rep(0, length(ss))
for (k in 1:nfolds) {
  abess_fit_k <- abess(dataset[["x"]][foldid != k, ], 
                       dataset[["y"]][foldid != k], support.size = ss)
  y_hat_k <- predict(abess_fit_k, dataset[["x"]][foldid == k, ], 
                     support.size = ss)
  fold_cv <- apply(y_hat_k, 2, function(yh) {
    mean((dataset[["y"]][foldid == k] - yh)^2)
  })
  fold_cv <- round(fold_cv, digits = 2)
  print(fold_cv)
  cv <- cv + fold_cv
}
cv <- cv / nfolds
names(cv) <- NULL
all.equal(cv, abess_fit$tune.value, digits = 2)

Expected behavior The output of all.equal(cv, abess_fit$tune.value, digits = 2) is TRUE. However, the output is "Mean relative difference: 0.0008444762".

System info

platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          1.0                         
year           2021                        
month          05                          
day            18                          
svn rev        80317                       
language       R                           
version.string R version 4.1.0 (2021-05-18)
nickname       Camp Pontanezen

Mamba413 commented 2 years ago

In most of case, the result is correct. For example, when p is changed to 20, 30, 40. The result are match to the expectation. I also print the result in Cpp and R. I seems that the difference is very subtle. The results printed in Cpp:

15520.7 3857.04 1832.87 1653.66    1220 1017.25 983.913  1148.9 1164.97 1739.28 1833.77

17938.3 5877.06 2929.43 2560.34 2416.98 2153.38 2565.76 2296.01 1999.35 2517.26 2355.97

13312.6 4048.81 2121.22 1885.49 1777.77 2327.84 2094.01 2507.27 2191.26 2632.55 2946.96

18170.5 4432.05 1575.29 1138.43 1291.11 1334.51 1109.72 1522.59 1977.38 1982.05 2043.54

13684.7 2128.02 1450.65 2080.38 2996.25 2554.03 3316.02 3065.15 2749.93 2768.66  2832.1

The results printed in R:

15520.68  3857.04  1832.87  1653.66  1220.00  1017.25   983.91  1148.90  1164.97  1739.28  1833.77 

17938.29  5877.06  2929.43  2560.34  2416.98  1991.74  2565.76  2296.01  1999.35  2517.26  2355.97 

13312.65  4048.81  2121.22  1885.49  1777.77  2327.84  2094.01  2507.27  2191.26  2632.55  2946.96 

18170.55  4432.05  1575.29  1138.43  1291.11  1334.51  1109.72  1522.59  1977.38  1982.05  2043.54 

13684.72  2128.02  1450.65  2080.38  2996.25  2554.03  3316.02  3065.15  2749.93  2768.66  2832.10

At present, I have no idea about why this happens. @Jiang-Kangkang @oooo26 @bbayukari

oooo26 commented 2 years ago

I found it caused by the rounding error in cout. Actually, I got the same Cpp results by simply std::cout << {TEST LOSS};, but with std::cout<<std::fixed<<std::setprecision(2)<< {TEST LOSS}; (it supported by #include<iomanip>), I got:

15520.68  3857.04  1832.87  1653.66  1220.00  1017.25   983.91  1148.90  1164.97  1739.28  1833.77
17938.29  5877.06  2929.43  2560.34  2416.98  2153.38  2565.76  2296.01  1999.35  2517.26  2355.97
13312.65  4048.81  2121.22  1885.49  1777.77  2327.84  2094.01  2507.27  2191.26  2632.55  2946.96
18170.55  4432.05  1575.29  1138.43  1291.11  1334.51  1109.72  1522.59  1977.38  1982.05  2043.54
13684.72  2128.02  1450.65  2080.38  2996.25  2554.03  3316.02  3065.15  2749.93  2768.66  2832.10

That is the same as R output.

Mamba413 commented 2 years ago

I find a very subtle error. In the second row and the sixth column, the output of cout is 2153.38 while R's output is 1991.74. The remaining values seem right.

How does it happens? The only possible reason (I believe) is that the coefficients used for computing test error are not the same as the beta passed to R.

oooo26 commented 2 years ago

I find out that the two (abess and manual) workflows are actually unequal because of normalization, which is enabled by default:

abess: full data -> normalize -> get training data in fold k -> (main workflow) -> beta
manual: data in fold k -> normalize -> (main workflow) -> beta

Since "full data" and "data in fold k" have different means and variances, the normalized data will also be different. And then it is no surprise that their betas are inequality, which causes inequality in test loss.

In most instances, it would not lead to different active sets selected by abess, so the betas (and test losses) are close. But unfortunately, it does in sparsity 5 with data in fold 2 above. That is why the "error" happens.

Btw, in other situations, the test losses are actually unequal. You would find the very small difference in more (e.g. 10) decimal places.

oooo26 commented 2 years ago

With this code, the output would be the same as Cpp's:

library(abess)
n <- 100
p <- 200
support.size <- 3
dataset <- generate.data(n, p, support.size, seed = 1)
ss <- 0:5

nfolds <- 5
foldid <- rep(1:nfolds, ceiling(n / nfolds))[1:n]
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
  tune.type = "cv", nfolds = nfolds,
  foldid = foldid, support.size = ss, num.threads = 1
)

x <- dataset[["x"]]
x <- apply(x, 2, function(c){c1 <- c - mean(c); c1 / norm(c1, type="2") * sqrt(n)})
y <- dataset[["y"]]
y <- y - mean(y)

cv <- rep(0, length(ss))
for (k in 1:nfolds) {
  abess_fit_k <- abess(x[foldid != k, ],
    y[foldid != k],
    support.size = ss,
    normalize=0
  )
  y_hat_k <- predict(abess_fit_k, x[foldid == k, ],
    support.size = ss
  )
  fold_cv <- apply(y_hat_k, 2, function(yh) {
    mean((y[foldid == k] - yh)^2)
  })
  fold_cv <- round(fold_cv, digits = 2)
  print(fold_cv)
  cv <- cv + fold_cv
}
cv <- cv / nfolds
names(cv) <- NULL
all.equal(cv, abess_fit$tune.value, digits = 2)

It simulates what the normalization does in abess.

abess-team / abess

cross validation in R package #248