abess-team / abess

Fast Best-Subset Selection Library
https://abess.readthedocs.io/
Other
474 stars 41 forks source link

Incorrect optimal support size in a multitask learning problem with nonnegativity constraints #510

Open tomwenseleers opened 1 year ago

tomwenseleers commented 1 year ago

Describe the bug I was testing abess on the multitask learning problem below (deconvolution of a multichannel signal using a known point spread function / blur kernel by regressing the multichannel signal on shifted copies of the point spread function) but I am experiencing poor performance, where the best subset is inferred to be much larger than it actually is (support size of best subset = 157, while true support size of simulated dataset is 50). Is there a way to resolve this? Does this have to do with the fact that abess currently does not support nonnegativity constraints (all my simulated model coefficients are positive)? Is there any way to allow for nonnegativity or box constraints? Even if not taken into account during fitting it could be handy if the coefficients could be clipped within the allowed range, so that the information criteria would at least indicate the correct support size of the best allowed model. Taking into account the constraints during fitting would of course be even better. I was also interested in fitting a multitask identity link Poisson model - as currently only Poisson log link is incorporated I could potentially still get something close to that by using an observation weights matrix 1/(Y+0.1), which would be approx. 1/Poisson variance weights and then using family="mgaussian". Is that allowed by any chance? Can observation weights be a matrix with family="mgaussian", to allow for different observation weights per outcome channel/task? If not, could this be allowed for?

Code for Reproduction

Paste your code for reproducing the bug:

library(remotes)
remotes::install_github("tomwenseleers/L0glm/L0glm")
library(L0glm)
# simulate blurred multichannel spike train
set.seed(1)
s <- 0.1 # sparsity (% of timepoints where there is a peak)
p <- 500 # 500 variables
# simulate multichannel blurred spike train with Gaussian noise
sd_noise <- 1
sim <- simulate_spike_train(n=p, 
                            p=p, 
                            k=round(s*p), # true support size = 0.1*500 = 50
                            mean_beta = 10000,
                            sd_logbeta = 1,
                            family="gaussian",
                            sd_noise = sd_noise,
                            multichannel=TRUE, sparse=TRUE)
X <- sim$X # covariate matrix with shifted copies of point spread function, n x p matrix
Y <- sim$y # multichannel signal (blurred spike train), n x m matrix  
colnames(X) = paste0("x", 1:ncol(X)) # NOTE: if colnames of X and Y are not set abess gives an error message, maybe fix this?
colnames(Y) = paste0("y", 1:ncol(Y))
true_coefs <- sim$beta_true # true coefficients
m <- ncol(Y) # nr of tasks
n <- nrow(X) # nr of observations
p <- ncol(X) # nr of independent variables (shifted copies of point spread functions)
W <- 1/(Y+0.1) # approx 1/variance Poisson observation weights with family="poisson", n x m matrix

# best subset multitask learning using family="mgaussian" using abess
abess_fit <- abess(x = X, 
                               y = Y, 
                               # weights = 1/(Y+0.1),  # QUESTION: if I use family="poisson" above I would like to be able to use this matrix as approx. 1/Poisson variance observation weights to be able to fix identity link poisson using family="mgaussian" in abess - is this possible/allowed ?
                               family = "mgaussian",
                               tune.path = "sequence", 
                               support.size = c(1:200),
                               lambda=0, 
                               warm.start = TRUE,
                               tune.type = "gic") # or cv or ebic or bic or aic
plot(abess_fit, type="tune")  # optimal support size would come out at 150 when in fact it is 50 - I presume because my coefficients are constrainted to be all positive, and abess currently does not allow for nonnegativity or box constraints?
beta_abess = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
library(qlcMatrix)
image(x=1:nrow(sim$y), y=1:ncol(sim$y), z=beta_abess^0.1, col = topo.colors(255),
      useRaster=TRUE,
      xlab="Time", ylab="Channel", main="abess multitask mgaussian (red=true peaks)")
abline(v=(1:nrow(sim$X))[as.vector(rowMax(sim$beta_true)!=0)], col="red")
abline(v=(1:nrow(sim$X))[as.vector(rowMin(beta_abess)!=0)], col="cyan")
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 157 peaks detected

Expected behavior

Couple of questions here:

Desktop (please complete the following information):

Screenshots GIC support size abess abess multitask mgaussian

Mamba413 commented 1 year ago

@tomwenseleers , thanks for your questions!

I have reproduced the results based on your pasted code, and I notice that the sample size may be insufficient for recovering such a little bit dense coefficient matrix that has 50 non-zero rows. According to A Polynomial Algorithm for Best-Subset Selection Problem, the recommended maximum support size is $n / (\log(p) * \log(\log(n)))$, i.e., around 44 in your setting. Therefore, recovering 50 non-zero rows is too difficult and exceeds the informatic theoretical limit.

A direct solution is collecting more samples. In the following example, 3 times sample size is used and abess returns an exactly correct result.

library(abess)
# simulate blurred multichannel spike train
s <- 0.1 # sparsity (% of timepoints where there is a peak)
p <- 500 # 500 variables
# simulate multichannel blurred spike train with Gaussian noise
sd_noise <- 1
sim <- simulate_spike_train(n=3*p, 
                            p=p, 
                            k=round(s*p), # true support size = 0.1*500 = 50
                            mean_beta = 10000,
                            sd_logbeta = 1,
                            family="gaussian",
                            sd_noise = sd_noise,
                            multichannel=TRUE, sparse=TRUE, 
                            seed = 123
                            )
X <- sim$X # covariate matrix with shifted copies of point spread function, n x p matrix
Y <- sim$y # multichannel signal (blurred spike train), n x m matrix  
colnames(X) = paste0("x", 1:ncol(X)) # NOTE: if colnames of X and Y are not set abess gives an error message, maybe fix this?
colnames(Y) = paste0("y", 1:ncol(Y))
true_coefs <- sim$beta_true # true coefficients
m <- ncol(Y) # nr of tasks
n <- nrow(X) # nr of observations
p <- ncol(X) # nr of independent variables (shifted copies of point spread functions)

# best subset multitask learning using family="mgaussian" using abess
abess_fit <- abess(x = X, 
                   y = Y, 
                   family = "mgaussian",
                   tune.path = "sequence", 
                   lambda=0, 
                   warm.start = TRUE,
                   tune.type = "gic") # or cv or ebic or bic or aic
plot(abess_fit, type="tune") 
beta_abess = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 50 peaks detected

real <- as.numeric(rowMax(sim$beta_true)!=0)
est <- as.numeric(rowMin(beta_abess)!=0)
mean(real == est)

So, abess can handle nonnegativity regression coefficients. I think your question is still insightful, and I believe the inclusion of nonnegativity (or box) constraints constraint can reduce the sample size for correctly solving this problem. We are now developing a related library to support a nonnegativity constraint. Once this project is mature, we will be glad to inform you if you are interested in this.

As for the question about multitask identity link Poisson models, I conduct a simple experiment to check your idea:

library(abess)
# simulate blurred multichannel spike train
s <- 0.1 # sparsity (% of timepoints where there is a peak)
p <- 500 # 500 variables
# simulate multichannel blurred spike train with Gaussian noise
sd_noise <- 1
sim <- simulate_spike_train(n=3*p, 
                            p=p, 
                            k=round(s*p), # true support size = 0.1*500 = 50
                            mean_beta = 10000,
                            sd_logbeta = 1,
                            family="poisson",
                            sd_noise = sd_noise,
                            multichannel=TRUE, sparse=TRUE, 
                            seed = 123
                            )
X <- sim$X # covariate matrix with shifted copies of point spread function, n x p matrix
Y <- sim$y # multichannel signal (blurred spike train), n x m matrix  
colnames(X) = paste0("x", 1:ncol(X)) 
colnames(Y) = paste0("y", 1:ncol(Y))
true_coefs <- sim$beta_true # true coefficients
m <- ncol(Y) # nr of tasks
n <- nrow(X) # nr of observations
p <- ncol(X) # nr of independent variables (shifted copies of point spread functions)
W <- 1/(Y+0.1) # approx 1/variance Poisson observation weights with family="poisson", n x m matrix

# best subset multitask learning using family="mgaussian" using abess
abess_fit <- abess(x = X, 
                   y = Y, 
                   weights = W,  
                   family = "mgaussian",
                   tune.path = "sequence", 
                   # support.size = c(1:50),
                   lambda=0, 
                   warm.start = TRUE,
                   tune.type = "gic") # or cv or ebic or bic or aic
plot(abess_fit, type="tune") 
beta_abess = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 51 peaks detected

real <- as.numeric(rowMax(sim$beta_true)!=0)
est <- as.numeric(rowMin(beta_abess)!=0)
mean(real == est)  # 0.998

The estimated result is very close to the true one.

Feel free to contact us if you still have any questions.

tomwenseleers commented 1 year ago

Many thanks for that - that's very helpful! And glad abesscorrectly takes into account the weight matrix for identity link Poisson. For the problem with Gaussian noise I also found that I can get a good solution by pre-selecting variables using nonnegative group LASSO with glmnet(which gives a solution that is still not quite sparse enough) & then selecting the best subset among the nonnegative group LASSO selected variables using abess (though this is admittedly not so elegant, and results in a partial loss of the computational edge that abess has compared to glmnet). But it does confirm that the poor performance was due to the lack of nonnegativity constraints in abess.

As a first step we then do this nonnegative group LASSO :

library(glmnet)
system.time(cvfit <- cv.glmnet(X, 
                               Y, 
                               alpha = 1,  # lasso
                               family = "mgaussian", 
                               nlambda = 100,
                               nfolds = 5, 
                               standardize = FALSE,
                               standardize.response = FALSE,
                               intercept = FALSE, 
                               relax = FALSE,
                               lower.limits = rep(0, ncol(X)+1))) # for nonnegativity constraints
# 26.71s

plot(cvfit)

# Fit the glmnet model over a sequence of lambda values
system.time(fit_glmnet <- glmnet(X, 
                     Y,
                     alpha = 1, # lasso
                     family = "mgaussian", 
                     nlambda = 100,
                     standardize = FALSE,
                     standardize.response = FALSE,
                     intercept = FALSE, 
                     relax = FALSE, 
                     lower.limits = rep(0, ncol(X)+1))) # for nonnegativity constraints
# 8.55s

# Select the best lambda value (I am using cvfit$lambda.1se to get slightly sparser solution)
best_lambda <- cvfit$lambda.1se

plot(fit_glmnet, xvar="lambda")

# get the coefficients for each task for the best lambda value
coefs <- coef(fit_glmnet, s = best_lambda)
beta_mgaussian_glmnet <- do.call(cbind, lapply(seq_len(m), function (channel) as.matrix(coefs[[channel]][-1,,drop=F])))
colnames(beta_mgaussian_glmnet) = 1:ncol(beta_mgaussian_glmnet)
beta_mgaussian_glmnet[abs(beta_mgaussian_glmnet)<0.01] = 0 # slight amount of thresholding of small coefficients

image(x=1:nrow(sim$y), y=1:ncol(sim$y), z=beta_mgaussian_glmnet^0.01, col = topo.colors(255),
      useRaster=TRUE,
      xlab="Time", ylab="Channel", main="nonnegative group Lasso glmnet (red=true support)")
abline(v=(1:nrow(sim$X))[as.vector(rowMax(sim$beta_true)!=0)], col="red")
# abline(v=(1:nrow(sim$X))[as.vector(rowMin(beta_mgaussian_glmnet)!=0)], col="cyan")
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_mgaussian_glmnet)!=0) # 71 peaks detected

Then we do best subset selection among these variables that are now known to yield nonnegative coefficients using abess:

library(abess)
preselvars = which(rowMin(beta_mgaussian_glmnet)!=0) # pre-select variables based on nonnegative mgaussian group LASSO
system.time(abess_fit <- abess(x = X[,preselvars,drop=F], 
                               y = Y, 
                               family = "mgaussian",
                               tune.path = "sequence",
                               support.size = c(1:length(preselvars)),
                               lambda=0, 
                               max.splicing.iter = 20,
                               warm.start = TRUE,
                               tune.type = "ebic") # or cv or bic or aic or gic
            ) # 0.29s
extract(abess_fit)$support.size # 55
plot(abess_fit, type="tune")

beta_abess = beta_mgaussian_glmnet
beta_abess[preselvars,] = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
beta_abess[beta_abess < 0] = 0
image(x=1:nrow(sim$y), y=1:ncol(sim$y), z=beta_abess^0.1, col = topo.colors(255),
      useRaster=TRUE,
      xlab="Time", ylab="Channel", main="abess multitask mgaussian (red=true peaks)")
abline(v=(1:nrow(sim$X))[as.vector(rowMax(sim$beta_true)!=0)], col="red")
abline(v=(1:nrow(sim$X))[as.vector(rowMin(beta_abess)!=0)], col="cyan")
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 55 peaks detected
length(intersect(which(rowMax(sim$beta_true)!=0), 
                 which(rowMin(beta_abess)!=0))) # 45 out of 50 peaks detected

With regard to the calculation of the information criteria - one thing I was not 100% sure about myself was how the effective degree of freedom should be defined for the multitask learning setting. Am I correct that abess just uses the nr. of selected variables, i.e. sum(rowMin(abs(beta))!=0), as effective degrees of freedom in the calculation of the information criteria? Or was it just sum(beta!=0), or perhaps sum(rowMin(abs(beta))!=0)*total nr. of tasks=sum(rowMin(abs(beta))!=0)*ncol(beta)? (Using sum(beta!=0) seemed to perform best for me at first sight). And as I mentioned for a problem with nonnegativity constraints clipping all negative coefficients to zero before calculating the information criteria also helps to identify the correct support size (as including too many variables then results in more coefficients being estimated as negative, which after clipping to zero would result in a worse log-likelihood; without clipping the log likelihood would keep on improving, but only by allowing coefficients to take non-permissible values). So I think an option allowing for that would also still be helpful (even if explicit allowance for box or nonnegativity constraints during fitting could not be put in in this package).

Mamba413 commented 1 year ago

I will first address your questions.

Finally, we are grateful to you for actively using abess to carry out numerical experiments that we have not tried before, which has inspired active discussions within our team. Our team is interested in your code for these experiments and believes it could help us add the option of non-negativity constraints to the abess package. Would you be willing to share your code on GitHub?

tomwenseleers commented 1 year ago

Well I was asking these questions also partly to compare the solution quality of abess with that of my own L0glmpackage that I was developing - that one uses quite a different principle to try to identity a stable set of variables closely corresponding to the best subset, namely using iterative adaptive ridge penalties, which has the advantage that it results in a very stable solution that also works well for high dimensional cases (see Frommlet & Nuel 2016 and Li et al. 2021 among others). There, all I had to do to make this work for any GLM (family and link function supported by default in R) was to replace the weighted least squares step in the IRLS GLM algorithm by a weighted least square ridge regression with adaptive penalty weights that depend on the coefficients in the previous iteration. To implement box or nonnegativity constraints in those regression steps one can use the osqpquadratic programming solver, cf code here, glmnet's bigGlm that has arguments lowerand upper, or just clip coefficients within the allowed range in each iteration (the latter also works surprisingly well in my algorithm, which has the advantage that I could then use all the fast Eigen solvers for either sparse or dense systems like Cholesky, conugate gradient or least squares conjugate gradient). The L0 penalized objective that one optimizes there also maps directly on information criteria like GIC, BIC, EBIC, mBIC or AIC, which means that one can choose lambda a priori to optimise a given information criterion, thereby avoiding the need to run your model over a lambda sequence or do cross validation. I then modified the basic algorithm for single channel data to the multitask learning case (for any of R's distribution families and link functions) by trying to get an L0,Linfinity block norm penalty in my model. Do get something like this what I did was to (1) first fit each channel/task independently - this would then correspond to an L0,L0 block norm penalty, (2) then keep on refitting each task sequentially but using the average coefficients over all tasks as starting values in my L0glm fit algorithm & iterate this until convergence - tasks for which all variable coefficients are zero will then be penalised down to zero and then (3) once the true support is identified refit all tasks for the selected variables without any penalization but just keeping the box constraints on the coefficients. I don't have any theoretical proofs that any of this is correct, but empirically it gives really good results, closely recovering the true support (coming up with mathematical proofs is also beyond my ability unfortunately, so my algorithms work more on my own intuition). I just put some code on github here showing the performance of my method: https://github.com/tomwenseleers/L0glm/blob/master/benchmarks/benchmarks%20multichannel%20deconvolution%20gaussian%20L0glm%20abess%20glmnet.R [caveat: my student is still working hard on this package so none of the code is as yet finished & most of the code is still in flux and as yet undocumented]. I would be keen to see how abesswould compare in these benchmarks once nonnegativity constraints would be allowed.

Right now the method can recover the true support for a noisy multichannel signal like this 1 simulated multichannel spike train with true coefficients 2 true coefficients true support with near perfect accuracy: 3 estimated coefficients and support nonneg relaxed group L0 4 true vs estimated coefficients relaxed group L0 simultaneously giving nearly perfect prediction accuracy, giving unbiased coefficients & recovering the true support with very high efficiency (I always thought there was a bias - variance tradeoff, but seems I can somehow resolve that tradeoff here - kind of magic).

By way of comparison, a glmnet group LASSO fit selects far too many variables & also has much worse prediction performance: 11 true vs estimated coefficients relaxed group L0

Mamba413 commented 1 year ago

Thank you for providing a detailed description of the algorithm and for generously sharing the details of L0glm. Intuitively, clipping appears to be a viable approach, particularly when the true regression coefficients of the variables fall within the specified range and the variables are independent. While this point requires a rigorous proof, preliminary implementations from a software perspective could benefit users. We will consider implementing this feature as soon as possible and will notify you once it is complete.

tomwenseleers commented 1 year ago

In this particular example, my variables are not independent, and yet clipping still seemed to perform very well to recover the true solution. In fact, osqp, where hard nonnegativity constraints are formally incorporated, did not perform better than just using the Eigen solvers & cllipping nonnegative values to zero. Probably that's because the true coefficients are indeed all positive and in my algorithm the optimal level of ridge regularisation I start off with is quite high and in subsequent iterations the adaptive penalties do their thing. But it did surprise me yes. I had thought that if nonnegativity constraints were incorporated via clipping that the optimal lambda would perhaps have come out a little higher than if nonnegativity constraints were hard constraints as with the osqpsolver, but that was not the case... So it was not a requirement for variables to be orthogonal to make clipping perform well to achieve nonnegativity constraints.

Mamba413 commented 1 year ago

Yes. I agree with your point of view that orthogonality is not necessary. I just used it as a direct entry point to understand why clipping is effective. Moreover, thank you for sharing the experimental results with us and explaining the power of clipping. We are willing to implement this in abess.

Mamba413 commented 1 year ago

@tomwenseleers Hello, I want to inform you that we have implemented the clipping operation in C++, which has made the entire iteration process very fast. Currently, this feature is available in the Python software package (see https://github.com/abess-team/abess/pull/516/files), but requires compiling and installing the latest code from GitHub. We will soon extend this feature to R and release it on CRAN, making it more convenient to use. It may not take long, and once it's done, we will notify you immediately. Finally, thank you for providing valuable suggestions for the abess project.

tomwenseleers commented 1 year ago

Ha that's very cool - many thanks for that! Looking forward to that! Did that result in good performance for this problem then, not only in terms of speed, but also in terms of support recovery & coefficient estimates? To what problem sizes does it scale then? Would example above also still run well e.g. for n=p=10000 (in my application that's my problem size)?

Mamba413 commented 1 year ago

Yes, it also supports the accurate recovery of relevant variables. Additionally, we tested it with n=p=10000. For simulated data, we were able to compute the results on a personal computer in approximately 1 to 2 minutes. Below is the specific code:

# %%
import numpy as np
import abess

np.random.seed(123)

# %%
%%time
n = 10000
p = 10000
k = 100

coef_real = np.zeros(p)
coef_real[np.random.choice(p, k, replace=False)] = 1
data = abess.make_glm_data(n, p, k, "gaussian", coef_=coef_real)
print(f"n={n}, p={p}, k={k}, ")

# %%
%%time
model = abess.LinearRegression()
model.fit(data.x, data.y, beta_low=0)

# %%
nonzero1 = np.nonzero(model.coef_)[0]
nonzero2 = np.nonzero(data.coef_)[0]

print(f"Successful? {(nonzero1==nonzero2).all()}")

I hope our follow-up developments will be helpful for your real-world data applications.

tomwenseleers commented 1 year ago

Many thanks - that sounds really promising - thanks for your work on this! My iterative adaptive ridge method took half a minute for that problem size (currently finishing up the documentation) - will be interesting to compare the performance in terms of solution quality! I think for small problems abess was faster than my method, whereas for big problems mine was slightly faster. One slight remark perhaps - I see beta_low and beta_high are passed as a single value and not as a vector to specify the bounds on each coefficient. Just bear in mind that setting beta_low to zero will work then for gaussian or poisson, but not for binomial or multinomial, as the intercept term there should always be allowed to be unbounded & only the remaining coefficients can be constrained to be nonnegative. But I imagine you've thought of that & taken care of this by not applying box constraints to the intercept term(s)? Another option of course would be to allow lower and upper to be vectors instead of single values, as in e.g. glmnet. That would also allow the flexibility to support different box constraints on different parameters if desired.

bbayukari commented 1 year ago

I'm delighted to inform you that version 0.4.8, which includes the functionality of restricting estimation ranges through truncation, has been successfully uploaded to CRAN. You can now conveniently utilize abess for best subset selection with non-negative constraints in R. With the parameters beta.low and beta.high, you can easily specify the lower and upper bounds for coefficient estimation. As you pointed out, these two parameters do not affect the estimation of the intercept term. Here is an example:

library(abess)

n = 10000
p = 10000
k = 100

coef_real = rep(0, p)
coef_real[sample(1:p, k, replace = FALSE)] <- 1
data <- generate.data(n, p, k, "gaussian", beta = coef_real)

abess_fit = abess(
  data$x, 
  data$y,
  family="gaussian",
  beta.low=0,
)

true_support_set = which(coef_real != 0)
est_support_set = which(coef(abess_fit, support.size = abess_fit[["best.size"]], sparse = FALSE)[-1] != 0)

all(true_support_set == est_support_set)

Thank you for your valuable feedback that contributed to this enhancement!

tomwenseleers commented 1 year ago

Many thanks for this! Today I was just testing with the latest version on CRAN (abess_0.4.8) but I am still running into a couple of problems - here a multitask learning example :

n = 500
p = 1000
y.dim = 10 # nr of tasks/channels
k = 100 # features with nonzero coefs

coef_real = matrix(0, nrow=p, ncol=y.dim)
nonzero_features = sample(1:p, k, replace = FALSE)
coef_real[nonzero_features,] = pmax(rnorm(k*y.dim, mean=1, sd=1), 0)

Here simulating multitask case with Gaussian noise, but also interested in multitask case with Poisson noise (& identity link), which I would like to deal with by using observation weight matrix for each outcome weight=1/(data$y+0.1) as approx. 1/variance weights for the Poisson case - it seems right now it only allows weight to be a vector though :

data = generate.data(n, p, support.size = k, 
                      family = "mgaussian", 
                      beta = coef_real,
                      y.dim = y.dim)

abess_fit = abess(
  x=data$x, 
  y=data$y,
  # weight=1/(data$y+0.1), # I'd like to use this to approximate identity link Poisson case - now gives error "Error in weight.glm(para) : is.vector(para$weight) is not TRUE"
  family="mgaussian",
  beta.low=0, # nonnegativity constraints
  fit.intercept=FALSE,
  tune.path="sequence",
  tune.type="aic"
)

If I would add weight=1/(data$y+0.1), which I'd like to be able to use to approximate for the case with Poisson noise (nonnegative identity link Poisson, which is now not supported), I get the error "Error in weight.glm(para) : is.vector(para$weight) is not TRUE". I.e. it doesn't seem to suggest a matrix as input for the moment (contrary to what was suggested above that it should).

coef_abess = coef(abess_fit, 
                  support.size = abess_fit[["best.size"]], 
                  sparse = FALSE)[[1]][-1,]

Here I noticed that extract(abess_fit) gives an error when fit.intercept=FALSE :

# Error in h(simpleError(msg, call)) : 
# error in evaluating the argument 'x' in selecting a method for function 'as.matrix': incorrect number of dimensions

This looks like a bug perhaps...

Any advice welcome! Especially on whether or not you might allow weight to be a matrix as opposed to a vector to also allow me to approximately fit nonnegativity identity link Poisson models. Alternatively, I suppose it would not be possible to allow all standard R GLM families & link functions? E.g. also family=poisson(identity)?

image(x=1:p, y=1:y.dim, z=coef_real, col = topo.colors(255), useRaster=TRUE) 
abline(v=(1:p)[as.vector(rowMax(coef_abess)!=0)], col="red")

table(rowMeans(coef_abess)!=0, 
      rowMeans(coef_real)!=0,
      dnn=c("estimated coefs nonzero", "true coefs nonzero"))
#                         true coefs nonzero
# estimated coefs nonzero FALSE TRUE
#                   FALSE   900   60
#                    TRUE     0     40
bbayukari commented 1 year ago

Thanks for your feedback! First, there is indeed an old bug in extract(), and we will address it promptly.

Regarding your two suggestions, allowing weights as matrix and specifying exponential family distributions and link functions in a similar manner to glm, I think they are excellent ideas. However, unfortunately, we do not have immediate plans to implement these features in the short term. When we initially designed the program, we did not anticipate the need for comprehensive coverage of all models and their variations. Consequently, even models with subtle differences compared to those already implemented may not be easily accommodated by straightforward modifications to the original codebase. This means we cannot indiscriminately add models to the codebase.

There is some documentation on how developers can implement new GLM models: users simply need to inherit from the _abessGLM class and implement the probability distribution function, link function, and the gradients and Hessians for optimization. Of course, this still comes with a significant learning curve. So, I recommend checking out another project from our team called skscope, which might be helpful for your needs.

skscope treats variable selection as an optimization problem with sparse constraints, specifically $min_x L(x) s.t. \|x\|_0 \leq s$. Users only need to implement the loss function (typically the negative log-likelihood function of the model) to achieve variable selection for any model. In fact, skscope is an optimizer that implements various algorithms, including splicing (the algorithm used by abess), to obtain sparse optimal solutions, even though sparse constraints make it a non-convex optimization problem.

It's worth noting that skscope is built on Google's JAX project, so it only has Python API. Additionally, skscope utilizes JAX's linear algebra operations, which, although faster than plain Python thanks to JIT compilation, are still far slower than the Eigen library used in abess. If you can accept these limitations, here's some Python code for multi-task linear regression with gaussian noise, no intercept, non-negative constraints, and matrix weights using skscope. This code should run smoothly after installing dependencies via pip or conda:

pip install skscope # or conda install skscope
pip install abess #  for create data
pip install pandas # for print results

Firstly, create simulating multitask case with Gaussian noise.

import skscope
from skscope import ScopeSolver
import abess
import numpy as np
import jax.numpy as jnp
import pandas as pd

np.random.seed(2023)
n = 500
p = 1000
y_dim = 10  
k = 100  

coef_real = np.zeros((p, y_dim))
nonzero_features = np.random.choice(np.arange(1, p + 1), k, replace=False)
coef_real[nonzero_features - 1, :] = np.maximum(
    np.random.normal(1, 1, size=(k, y_dim)), 0
)

data = abess.make_multivariate_glm_data(
    n, p, k, family="multigaussian", coef_=coef_real, M=y_dim
)

weight = 1 #/(data.y+0.1) ## matrix weights default to 1

Then, run skscope to solve the problem:

# ScopeSolver uses similar algo with abess and other solvers may be useful too.
solver = ScopeSolver( 
    dimensionality=p * y_dim,
    sparsity=range(2*k), # too slow to run from 0 to p
    sample_size=n,
    group=[i for i in range(p) for _ in range(y_dim)],
    ic_type="sic",
)
# Key is the objective function, which is squared error loss with weight matrix
solver.solve(
    objective=lambda params: jnp.sum(jnp.square(weight * (data.x @ params.reshape((p, y_dim)) - data.y))),
    layers=[skscope.layer.NonNegative(dimensionality = p * y_dim)], # non-negative constraint
    jit=True,
)
coef_est = solver.get_estimated_params().reshape(p, y_dim)

Finally, print the results.

comparison_data = {
    "estimated coefs nonzero": np.sum(coef_est != 0, axis=1) != 0,
    "true coefs nonzero": np.sum(coef_real != 0, axis=1) != 0
}

comparison_df = pd.DataFrame(comparison_data)
table = pd.crosstab(comparison_df["estimated coefs nonzero"], comparison_df["true coefs nonzero"])
print(table)

Furthermore, I attempted to write a Poisson regression with an identity link, but the logarithmic term in the loss function necessitates a good initial value for optimization in each subproblem to avoid taking the logarithm of a negative number. If there's a way to easily specify good initial values, this code might be completed quickly.

Once again, thank you for your friendly and helpful feedback!

tomwenseleers commented 1 year ago

Many thanks for this - that's very helpful! I was just sort of hoping that allowing the observation weights to be a matrix instead of a vector might be a small change in the code - but of course I fully understand if it's not & that implementing this change would be too much work right now!