Quantco / glum

High performance Python GLMs with all the features!
https://glum.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
308 stars 24 forks source link

`GeneralizedLinearRegression(warm_start=True).fit(X, y)` raises if `X` has a zero-std column #645

Closed mlondschien closed 1 year ago

mlondschien commented 1 year ago

Minimal reproducible example:

import numpy as np
from glum import GeneralizedLinearRegressor

rng = np.random.default_rng(0)
X = rng.normal(size=(100, 100))
X[:, 0] = 0
beta = np.zeros(100)
beta[:10] = 1
y = X @ beta + rng.normal(size=100)

model = GeneralizedLinearRegressor(family="normal", alpha=0.1, warm_start=True)
model.fit(X[:99, :], y[:99])
model.fit(X, y)

This raises

(glum) ~/code/icu-experiments [shared-lasso] $ python repr_example.py 
/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/glum/_glm.py:407: RuntimeWarning: invalid value encountered in divide
  coef[0] += np.squeeze(col_means / col_stds).dot(coef[1:])
Traceback (most recent call last):
  File "/Users/mlondschien/code/icu-experiments/repr_example.py", line 13, in <module>
    model.fit(X, y)
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/glum/_glm.py", line 2518, in fit
    coef = self._solve(
           ^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/glum/_glm.py", line 1014, in _solve
    coef, self.n_iter_, self._n_cycles, self.diagnostics_ = _irls_solver(
                                                            ^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/glum/_solvers.py", line 315, in _irls_solver
    d, n_cycles_this_iter = inner_solver(state, data, hessian)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/glum/_solvers.py", line 36, in inner_fct
    out = fct(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/glum/_solvers.py", line 52, in _least_squares_solver
    d = linalg.solve(hessian, state.score, assume_a="pos")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/scipy/linalg/_basic.py", line 144, in solve
    b1 = atleast_1d(_asarray_validated(b, check_finite=check_finite))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/scipy/_lib/_util.py", line 252, in _asarray_validated
    a = toarray(a)
        ^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/glum/lib/python3.11/site-packages/numpy/lib/function_base.py", line 628, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

The first warning hints at the problem.

A quick fix would be something along the lines of

col_stds[col_stds == 0] = 1

as these columns don't contribute to coef[0] anyways. However, the question of how to account for floating point inaccuracies arises.

MarcAntoineSchmidtQC commented 1 year ago

Interesting. I think the solution is to use the function just above in the code you linked - _one_over_var_inf_to_val. That's how we've been dealing with this problem elsewhere in the codebase.