abess-team / abess

Fast Best-Subset Selection Library
https://abess.readthedocs.io/
Other
474 stars 41 forks source link

One-from-group or N-from-group #492

Closed MattWenham closed 1 year ago

MattWenham commented 1 year ago

Best Subset of Group Selection allows for all members of non-overlapping groups to be selected, but would the converse be possible, i.e. choosing only one variable from each group? This scheme could be extended to N-from-group, or allow the user to set a maximum proportion of each group (e.g. 50%) which should be selected.

If any of this is possible in with current version of the software, I would be grateful to know how it can be implemented.

Yanhang129 commented 1 year ago

Thanks for your question! Do you mean that there is no sparsity across groups, only sparsity within each group?

MattWenham commented 1 year ago

Thanks for your question! Do you mean that there is no sparsity across groups, only sparsity within each group?

As I understand your question, my initial question as posed assumes no sparsity across groups - i.e. all groups should be considered - but an extension could also be to allow sparsity across groups, e.g. choose 1 or N from each of M groups, perhaps.

Yanhang129 commented 1 year ago

Thanks for your question! Do you mean that there is no sparsity across groups, only sparsity within each group?

As I understand your question, my initial question as posed assumes no sparsity across groups - i.e. all groups should be considered - but an extension could also be to allow sparsity across groups, e.g. choose 1 or N from each of M groups, perhaps.

Sorry for the delay. It takes sometimes for us to discussion a possible solution. At present, we cannot use one line command to handle this problem, but we give an alternative solution here.

Our idea aims to use a block-coordinate-wise manner to select s variables in each group (suppose there are J groups). Specifically, we fix the results of the J-1 groups and use the splicing algorithm to obtain the optimal s variables for the first group. This process is similarly repeated for the remaining J-1 groups. The iteration stops when the active set remains unchanged after updating all the J groups.

Here is the implementation for our idea:

library(abess)
library(Matrix)
abess.group <- function(x, y, group, s){
  n <- nrow(x)
  p <- ncol(x)
  group.num <- length(table(group))
  # Marginal screening for each group to find the top s variables in each group
  x.cor <- apply(x, 2, function(x) abs(cor(x, y)))
  gi <- unique(group)
  index <- match(gi, group)
  A1 <- tapply(x.cor, group, function(x) order(x, decreasing = TRUE)[1:s])
  A1 <- unlist(mapply(function(x, y) x + y - 1, index, A1, SIMPLIFY = FALSE))
  A1 <- sort(A1)
  # The iteration for block-wise best-subset selection:
  A <- numeric(length(A1))
  while(any(A1 != A)) {
    A = A1
    for (i in 1:group.num) {
      x_temp <- cbind(
        x[, group == i],                 # the data of the i-th group
        x[, A1[-(((i-1)*s+1):(i*s))]]    # the data of the selected variables in remaining groups
      )
      res <- abess(
        x_temp, y, 
        support.size = ((group.num-1)*s+1):(group.num*s),     # to select s variables in the i-th group (leveraging warm-start)
        always.include=tail(1:ncol(x_temp), (group.num-1)*s)  # to include the selected variables in remaining groups
      )
      ind <- which(extract(res, support.size = group.num*s)$beta != 0)[1:s]
      A1[(((i-1)*s+1):(i*s))] <- ind + index[i]-1
    }
  }
  return(A1)
}

We examine this idea by code below, and find it works pretty well.

set.seed(1)
n <- 100
p <- 50
group <- rep(1:5, each = 10)
beta <- rep(c(1, -1, rep(0, 8)), times = 5)
x <- matrix(rnorm(n*p, 0, 1), n, p)
y <- x %*% beta + rnorm(n, 0, 1)
s <- 2 # Each group selects s(=2) variables
abess.group(x, y, group, s)
MattWenham commented 1 year ago

Many thanks, I will take a look at this in more detail soon. If you have a python version of this implementation, that would be very useful, otherwise I will work on a conversion if we feel it will be useful.

Mamba413 commented 1 year ago

@MattWenham , to ease understanding, I have just modified the code above and I hope that helps. We are also willing to implement it with python, but it will take a few days.

Mamba413 commented 1 year ago

@MattWenham @Weiniily hi, I have programmed a python version for this idea. Here is the code:

from abess.linear import LinearRegression
from abess.datasets import make_glm_data
from sklearn.feature_selection import r_regression
from sklearn.feature_selection import SelectFromModel
import numpy as np

def abess_group(x, y, group, s):
    n, p = x.shape
    group_label = np.unique(group)
    group_num = len(group_label)

    # Marginal screening for each group to find the top s variables in each group:
    x_cor = np.abs(r_regression(x, y))
    A = []
    group_start_index = []
    for i in group_label:
        group_i_index = group == i
        group_start_index.append(np.where(group_i_index)[0].min())
        each_group_num = np.sum(group_i_index)
        A.extend(np.where(
            np.logical_and(group_i_index, x_cor >= np.sort(x_cor[group_i_index])[each_group_num - s])
        )[0].tolist())
    A = np.array(A)
    group_start_index = np.array(group_start_index)

    # The iteration for block-wise best-subset selection:
    model = LinearRegression()
    A1 = np.array([0] * s * group_num)
    select_var_group = group_label.repeat(s)
    while np.any(A != A1):
        A1 = A.copy()
        for i in group_label:
            x_toselect = x[:, group == i]               # the data of the i-th group
            x_fixed = x[:, A[select_var_group != i]]    # the data of the selected variables in remaining groups
            candidate_num = x_toselect.shape[1]
            x_fit = np.hstack([x_toselect, x_fixed])
            model.set_params(
                # to include the selected variables in remaining groups:
                always_select=np.arange(candidate_num, x_fit.shape[1]), 
                # to select s variables in the i-th group (leveraging warm-start):
                support_size=range(candidate_num, s+candidate_num),    
            )
            model.fit(x_fit, y)
            # update the selected s variables:
            select_result_tmp = SelectFromModel(estimator=model, prefit=True).get_support()[range(candidate_num)]
            A[select_var_group == i] = np.array(np.where(select_result_tmp)[0] + group_start_index[i])
        pass
    return A

I simply test this implementation via:

np.random.seed(1)
p = 50
J = 10
group = np.linspace(0, 4, 5).repeat(10).astype(np.int32)
coef_ = np.array([1.0 if i % 10 == 0 or i %10 == 1 else 0.0 for i in range(50)])
dataset = make_glm_data(n=100, p=50, coef_=coef_, family='gaussian', k = 10)
print('True:', np.where(coef_ != 0.0)[0])
est = abess_group(dataset.x, dataset.y, group, s=2)
print("Estimate:", est)

And the result is pretty well:

True: [ 0  1 10 11 20 21 30 31 40 41]
Estimate: [ 0  1 10 11 20 21 30 31 40 41]

Finally, I attach my python and OS info:

Wish that helps your research and data analysis!

MattWenham commented 1 year ago

@MattWenham @Weiniily hi, I have programmed a python version for this idea.

Many thanks, that's certainly code we can work with for our application 😊