abess-team / abess

Fast Best-Subset Selection Library
https://abess.readthedocs.io/
Other
473 stars 41 forks source link

[Bug] No termination within reasonable time for Poisson regression in a specific case #504

Open brtang63 opened 1 year ago

brtang63 commented 1 year ago

I've encountered a strange issue: abess() does not terminate in a specific situation. The following code produces a reproducible example. It runs for at least 10 mins without termination. However, by simply setting support.size = 0:13 or support.size = 14, it terminates immediately (perhaps within 1 second). Moreover, when tune.type = "gic", this issue also didn't happen, which makes me really confused.

The version of abess is 0.4.7 (installed from CRAN). I've tested the code on two different Linux systems. The same issue is encountered.

library(abess)
seed <- 1
n <- 100
p <- 1000
family <- "poisson"
snr <- Inf
beta <- rep(0, p)
nonzero <- sample(1:p, 10)
beta[nonzero] <- c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5)
k <- 10

data <- generate.data(n, p, beta = beta, snr = snr, family = family, support.size = k, seed = seed)
x <- data$x
y <- data$y

abess(x, y, tune.type = "cv", family = "poisson", support.size = 0:14)
Mamba413 commented 1 year ago

Thanks. I can reproduce this on my laptop. It may be caused by the extremely large value of the deviance when setting support.size = 0:14.

> abess(x, y, tune.type = "gic", family = "poisson", support.size = 0:13)
Call:
abess.default(x = x, y = y, family = "poisson", tune.type = "gic",  support.size = 0:13)

   support.size           dev          GIC
1             0 -7.581848e+14 -1.51637e+15
2             1 -2.298525e+34 -4.59705e+34
3             2 -2.298525e+34 -4.59705e+34
4             3 -2.298525e+34 -4.59705e+34
5             4 -2.298525e+34 -4.59705e+34
6             5 -2.298525e+34 -4.59705e+34
7             6 -2.298525e+34 -4.59705e+34
8             7 -2.298525e+34 -4.59705e+34
9             8 -2.298525e+34 -4.59705e+34
10            9 -2.298525e+34 -4.59705e+34
11           10 -2.298525e+34 -4.59705e+34
12           11 -2.298525e+34 -4.59705e+34
13           12 -2.298525e+34 -4.59705e+34
14           13 -2.298525e+34 -4.59705e+34
Mamba413 commented 1 year ago

@oooo26 , I have uploaded two files poisson_y.csv and poisson_x.csv that corresponds to y and x, respectively. Can you test whether this issue happens in python? poisson_x.csv poisson_y.csv

oooo26 commented 1 year ago

Hi, sorry for the late response. I have checked in Python, but the problem seems to not happen.

ABESS version: latest, v0.4.6(PyPI) Python version: 3.9.12

Here is the test code:

import numpy as np
import pandas as pd
import abess

X = pd.read_csv("poisson_x.csv")
y = pd.read_csv("poisson_y.csv").squeeze()
print(X.shape)
print(y.shape)

model = abess.PoissonRegression(
    support_size=range(15),     # 0:14
    cv=5                        # both CV and IC are working
)
model.fit(X, y)

print(f"Sparsity: {np.count_nonzero(model.coef_)}")
print(f"Non-zero: {np.nonzero(model.coef_)[0]}")
print(f"Train Loss: {model.train_loss_}")
print(f"Test Loss: {model.eval_loss_}")
######
# Sparsity: 4
# Non-zero: [122 352 573 769]
# Train Loss: -2360540438301305.5
# Test Loss: -729389503380903.0
######
Mamba413 commented 11 months ago

@brtang63 , can you check this issue on the latest abess R package? I believe this problem has been addressed.

brtang63 commented 4 months ago

Sorry for the late reply. I've tested with the latest CRAN version 0.4.8. I find this problem still happens occasionally. Note that the previous example I posted is not a good one, as seed is only set for generate.data(), but not for sample(). The following code is more reproducible. set.seed(1) works fine, but set.seed(2) still leads to this problem.

R version 4.3.1 abess version: 0.4.8

library(abess)

set.seed(2)
n <- 100
p <- 1000
family <- "poisson"
snr <- Inf
beta <- rep(0, p)
nonzero <- sample(1:p, 10)
beta[nonzero] <- c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5)
k <- 10

data <- generate.data(n, p, beta = beta, snr = snr, family = family, support.size = k)
x <- data$x
y <- data$y

abess(x, y, tune.type = "cv", family = "poisson", support.size = 0:14)
Mamba413 commented 4 months ago

@brtang63 I guess this is because the estimated coefficients are unbounded because of the natural of poisson distribution. In the new version of abess library, you can use the beta.max and beta.min to control the range of estimated coefficients. You may refer this link: https://github.com/abess-team/abess/issues/510#issuecomment-1732315856