abess-team / abess

Fast Best-Subset Selection Library
https://abess.readthedocs.io/
Other
474 stars 41 forks source link

Colname errors #526

Closed qianchd closed 1 year ago

qianchd commented 1 year ago

Describe the bug

In some extreme case, the colnames can be duplicated. For example colnames=("", "", "", "") or ("X", "X", "X", "X"). It makes the functionpredict.abess fail when it tries to run newx <- newx[, vn]. However the main fitting function abess::abess still pass.

Code for Reproduction

n=100
p=50
b=c(1,1,rep(0, p - 2))
X = matrix(rnorm(n * p), n, p)
y <- X %*% b + rnorm(n)
colnames(X) <- rep("", p)
md <- abess::abess(X, y) # it passes the test 
predict(md, newx=X[1:10, ]) # Error in newx[, vn] : subscript out of bounds

A clear and concise description of what you expected to happen.

The colnames of X need to be checked in the abess::abess function. If the colnames are duplicated, either a error should be raised or the matrix X should be treated as the unamed matrix (colnames(X) == NULL). Alternative choice is to check if the rownames set rownames(object[["beta"]]) is a subset of the colnames of newx which improves the line if (!is.null(colnames(newx))) in the predict.abess function.

bbayukari commented 1 year ago

Thanks for your valuable feedback, and I'm pleased to incorporate one of your suggestions into the program.

abess prioritizes using column names rather than positions for identification so that duplicated names will cause confusion. We plan to add the following checks to avoid this situation:

  if (length(unique(para$vn)) != length(para$vn)) {
    stop("The colnames of x are duplicated!")
  }

Once again, thank you for your assistance.