Quantco / glum

High performance Python GLMs with all the features!
https://glum.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
307 stars 24 forks source link

Tweedie distribution doesnt fit #716

Closed juansecal closed 6 months ago

juansecal commented 12 months ago

The Tweedie regression gives error ValueError: array must not contain infs or NaNs

There are no infinate or Nan values


TweedieDist = TweedieDistribution(1.5)

f_glm1 = GeneralizedLinearRegressor(family=TweedieDist, alpha_search=True, l1_ratio=1.5,
                                                   fit_intercept=True,max_iter=200)
jtilly commented 12 months ago

Could you please give a self-contained, reproducible example of the problem? If that's not possible, then please at least show us the full stack trace with the error.

I would expect your code snippet to error with

File ~/glum/src/glum/_glm.py:2815, in GeneralizedLinearRegressor._validate_hyperparameters(self)
   2803         raise ValueError(
   2804             "Penalty term must be a non-negative number;"
   2805             " got (alpha={})".format(self.alpha)
   2806         )
   2808 if (
   2809     not np.isscalar(self.l1_ratio)
   2810     # check for numeric, i.e. not a string
   (...)
   2813     or self.l1_ratio > 1
   2814 ):
-> 2815     raise ValueError(
   2816         "l1_ratio must be a number in interval [0, 1];"
   2817         " got (l1_ratio={})".format(self.l1_ratio)
   2818     )
   2819 super()._validate_hyperparameters()

ValueError: l1_ratio must be a number in interval [0, 1]; got (l1_ratio=1.5)

Here's an example on how to fit a Tweedie model (with alpha_search=True) using the data shown in the README:

from sklearn.datasets import fetch_openml
from glum import GeneralizedLinearRegressor
from glum import TweedieDistribution

# This dataset contains house sale prices for King County, which includes
# Seattle. It includes homes sold between May 2014 and May 2015.
house_data = fetch_openml(name="house_sales", version=3, as_frame=True)

X = house_data.data[
    [
        "bedrooms",
        "bathrooms",
        "sqft_living",
        "floors",
        "waterfront",
        "view",
        "condition",
        "grade",
        "yr_built",
        "yr_renovated",
    ]
]

y = house_data.target

model = GeneralizedLinearRegressor(
    family=TweedieDistribution(1.5),
    alpha_search=True,
    l1_ratio=0.5,
    fit_intercept=True,
    max_iter=200,
)

model.fit(X=X, y=y)
juansecal commented 12 months ago

Sure, the data is just losses and exposure, classic GLM fitting, no NA values or Inf. 3% frequency

Coordinate descent did not converge. You might want to increase the number of iterations. Minimum norm subgradient: nan, tolerance: nan newcoef, gap, , _, n_cycles = enet_coordinate_descent_gram( Traceback (most recent call last): File "C:\Users\jcalderon\AppData\Local\JetBrains\PyCharm Community Edition 2023.2.1\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode coro = func() File "", line 1, in File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\glum_glm.py", line 3000, in fit coef = self._solve_regularization_path( File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\glum_glm.py", line 1099, in _solve_regularization_path coef = self._solve( File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\glum_glm.py", line 1034, in _solve coef, self.niter, self._ncycles, self.diagnostics = _irls_solver( File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\glum_solvers.py", line 325, in _irls_solver ) = line_search(state, data, d) File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\glum_solvers.py", line 36, in inner_fct out = fct(*args, *kwargs) File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\glum_solvers.py", line 769, in line_search P1wd_1 = linalg.norm(data.P1 (state.coef + d)[data.intercept_offset :], ord=1) File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\scipy\linalg_misc.py", line 146, in norm a = np.asarray_chkfinite(a) File "C:\Users\jcalderon\PycharmProjects\GLM\venv\lib\site-packages\numpy\lib\function_base.py", line 630, in asarray_chkfinite raise ValueError( ValueError: array must not contain infs or NaNs

jtilly commented 12 months ago

Thanks! Based on that output alone, it's difficult for me to tell what's going wrong. Sorry! Also, as mentioned above, if you're really running this with l1_ratio=1.5, I would expect you to hit a different error.

Are you using a private data set or are you testing this against, e.g., the publicly available "French Motor TPL Insurance Claims Data" (which we also use in our benchmark suite)?