dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python
https://pygam.readthedocs.io
Apache License 2.0
862 stars 159 forks source link

scikit-sparse installed but not detected? #209

Closed echo66 closed 5 years ago

echo66 commented 5 years ago

I'm trying to use LogisticGAM with a really sparse pandas dataframe. But I'm getting this warning:

/home/echo66/.local/share/virtualenvs/pygam-505CBPMV/lib/python3.6/site-packages/pygam/utils.py:78: UserWarning: Could not import Scikit-Sparse or Suite-Sparse.
This will slow down optimization for models with monotonicity/convexity penalties and many splines.
See installation instructions for installing Scikit-Sparse and Suite-Sparse via Conda.
  warnings.warn(msg)
dswah commented 5 years ago

@echo66 hmm that's annoying...

the import for scikit-sparse also checks for nose. do you have that installed too? (if not, you can definitely install with conda, maybe pip too)

either way, pyGAM should be quite fast on sparse problems even without scikit sparse (?)

please let me know, dani

echo66 commented 5 years ago

Thanks! That solved the issue (I'm using pipenv).

Regarding speed, I'm using a dataset with shape (10000, 18). After applying one hot encoding, I get a shape of (10000, 638). I started the fitting procedure 10 minutes ago and I'm still waiting for it to finish.

My CPU is a quadcore i7-7700HQ, with 16 GB RAM.

dswah commented 5 years ago

mmm i see. that is quite a big matrix.

are you working on a linear problem, or something more exotic? also, are you using any constraints?.

since pyGAM uses p-irls to do MLE, the optimization is usually dominated by the QR factorization in every iteration, except when using constraints, in which case the (scikit sparse) cholesky factorization begins to matter.

There exists a library for solving sparse QR factorization problems, but in my limited experiments I found no benefit :/

@echo66 please let me know how long it takes, and if it works etc

-dani

echo66 commented 5 years ago

By linear problem, what are you referring to, exactly? I'm not using any constraints.

dswah commented 5 years ago

@echo66 by linear i mean normal distribution with identity link function.

this type of model only requires one itertion of the optimizer, while any other will require > 1

dswah commented 5 years ago

@echo66 did it coverge?

and which type of model are you using?

dswah commented 5 years ago

ah, whoops i see that you are using a logistic gam. i misread your first message. sorry about that.

echo66 commented 5 years ago

Regarding the distribution, I didn't change the distribution with things like cox-box transformation. I was just trying to experiment with pyGAM in the same manner people do with logistic regression in scikit. By the way, most of the features are binary because of OHE.

Do you think that I should apply restrictions to make things faster?

dswah commented 5 years ago

i see. no, adding restrictions will not make things faster, but instead add another, possibly greater, computational burden.

did your model converge?

your question raises an important issue for pyGAM regarding the treatment of categorical variables. the current, lazy implementation fits a coefficient for every category, when really you only need n-1 coefficients for a variable with n categories.

this approach poses statistical problems, since the coefficient values are harder to decipher, but also computational problems!

for example you mentioned that your model has mostly binary categories, which means we should need only half as many model parameters as pygam currently allocates. .

and since qr factorizations costs ~O(m*(n^2)) for a m by n matrix, we would get a 4x speedup on your model ...

echo66 commented 5 years ago

I think in one of the versions I was able to get the outputs but I don't remember if it converges. Sorry if I can't help you with more information, I'm having some issues with my python environment today.

dswah commented 5 years ago

@echo66

thats ok, im just curious about that and am always looking to improve pyGAM.

perhaps we should close this issue and continue the conversation as needed.

dani