Closed MattWenham closed 1 year ago
Thanks for your question! Do you mean that there is no sparsity across groups, only sparsity within each group?
Thanks for your question! Do you mean that there is no sparsity across groups, only sparsity within each group?
As I understand your question, my initial question as posed assumes no sparsity across groups - i.e. all groups should be considered - but an extension could also be to allow sparsity across groups, e.g. choose 1 or N from each of M groups, perhaps.
Thanks for your question! Do you mean that there is no sparsity across groups, only sparsity within each group?
As I understand your question, my initial question as posed assumes no sparsity across groups - i.e. all groups should be considered - but an extension could also be to allow sparsity across groups, e.g. choose 1 or N from each of M groups, perhaps.
Sorry for the delay. It takes sometimes for us to discussion a possible solution. At present, we cannot use one line command to handle this problem, but we give an alternative solution here.
Our idea aims to use a block-coordinate-wise manner to select s variables in each group (suppose there are J groups). Specifically, we fix the results of the J-1 groups and use the splicing algorithm to obtain the optimal s variables for the first group. This process is similarly repeated for the remaining J-1 groups. The iteration stops when the active set remains unchanged after updating all the J groups.
Here is the implementation for our idea:
library(abess)
library(Matrix)
abess.group <- function(x, y, group, s){
n <- nrow(x)
p <- ncol(x)
group.num <- length(table(group))
# Marginal screening for each group to find the top s variables in each group
x.cor <- apply(x, 2, function(x) abs(cor(x, y)))
gi <- unique(group)
index <- match(gi, group)
A1 <- tapply(x.cor, group, function(x) order(x, decreasing = TRUE)[1:s])
A1 <- unlist(mapply(function(x, y) x + y - 1, index, A1, SIMPLIFY = FALSE))
A1 <- sort(A1)
# The iteration for block-wise best-subset selection:
A <- numeric(length(A1))
while(any(A1 != A)) {
A = A1
for (i in 1:group.num) {
x_temp <- cbind(
x[, group == i], # the data of the i-th group
x[, A1[-(((i-1)*s+1):(i*s))]] # the data of the selected variables in remaining groups
)
res <- abess(
x_temp, y,
support.size = ((group.num-1)*s+1):(group.num*s), # to select s variables in the i-th group (leveraging warm-start)
always.include=tail(1:ncol(x_temp), (group.num-1)*s) # to include the selected variables in remaining groups
)
ind <- which(extract(res, support.size = group.num*s)$beta != 0)[1:s]
A1[(((i-1)*s+1):(i*s))] <- ind + index[i]-1
}
}
return(A1)
}
We examine this idea by code below, and find it works pretty well.
set.seed(1)
n <- 100
p <- 50
group <- rep(1:5, each = 10)
beta <- rep(c(1, -1, rep(0, 8)), times = 5)
x <- matrix(rnorm(n*p, 0, 1), n, p)
y <- x %*% beta + rnorm(n, 0, 1)
s <- 2 # Each group selects s(=2) variables
abess.group(x, y, group, s)
Many thanks, I will take a look at this in more detail soon. If you have a python version of this implementation, that would be very useful, otherwise I will work on a conversion if we feel it will be useful.
@MattWenham , to ease understanding, I have just modified the code above and I hope that helps. We are also willing to implement it with python, but it will take a few days.
@MattWenham @Weiniily hi, I have programmed a python version for this idea. Here is the code:
from abess.linear import LinearRegression
from abess.datasets import make_glm_data
from sklearn.feature_selection import r_regression
from sklearn.feature_selection import SelectFromModel
import numpy as np
def abess_group(x, y, group, s):
n, p = x.shape
group_label = np.unique(group)
group_num = len(group_label)
# Marginal screening for each group to find the top s variables in each group:
x_cor = np.abs(r_regression(x, y))
A = []
group_start_index = []
for i in group_label:
group_i_index = group == i
group_start_index.append(np.where(group_i_index)[0].min())
each_group_num = np.sum(group_i_index)
A.extend(np.where(
np.logical_and(group_i_index, x_cor >= np.sort(x_cor[group_i_index])[each_group_num - s])
)[0].tolist())
A = np.array(A)
group_start_index = np.array(group_start_index)
# The iteration for block-wise best-subset selection:
model = LinearRegression()
A1 = np.array([0] * s * group_num)
select_var_group = group_label.repeat(s)
while np.any(A != A1):
A1 = A.copy()
for i in group_label:
x_toselect = x[:, group == i] # the data of the i-th group
x_fixed = x[:, A[select_var_group != i]] # the data of the selected variables in remaining groups
candidate_num = x_toselect.shape[1]
x_fit = np.hstack([x_toselect, x_fixed])
model.set_params(
# to include the selected variables in remaining groups:
always_select=np.arange(candidate_num, x_fit.shape[1]),
# to select s variables in the i-th group (leveraging warm-start):
support_size=range(candidate_num, s+candidate_num),
)
model.fit(x_fit, y)
# update the selected s variables:
select_result_tmp = SelectFromModel(estimator=model, prefit=True).get_support()[range(candidate_num)]
A[select_var_group == i] = np.array(np.where(select_result_tmp)[0] + group_start_index[i])
pass
return A
I simply test this implementation via:
np.random.seed(1)
p = 50
J = 10
group = np.linspace(0, 4, 5).repeat(10).astype(np.int32)
coef_ = np.array([1.0 if i % 10 == 0 or i %10 == 1 else 0.0 for i in range(50)])
dataset = make_glm_data(n=100, p=50, coef_=coef_, family='gaussian', k = 10)
print('True:', np.where(coef_ != 0.0)[0])
est = abess_group(dataset.x, dataset.y, group, s=2)
print("Estimate:", est)
And the result is pretty well:
True: [ 0 1 10 11 20 21 30 31 40 41]
Estimate: [ 0 1 10 11 20 21 30 31 40 41]
Finally, I attach my python and OS info:
abess
(version: 0.4.6)Wish that helps your research and data analysis!
@MattWenham @Weiniily hi, I have programmed a python version for this idea.
Many thanks, that's certainly code we can work with for our application 😊
Best Subset of Group Selection allows for all members of non-overlapping groups to be selected, but would the converse be possible, i.e. choosing only one variable from each group? This scheme could be extended to N-from-group, or allow the user to set a maximum proportion of each group (e.g. 50%) which should be selected.
If any of this is possible in with current version of the software, I would be grateful to know how it can be implemented.