bbalasub1 / glmnet_python

GNU General Public License v3.0
199 stars 94 forks source link

Problem with the `pmax` option and the `gaussian` family... #40

Open michaelcoconnor opened 5 years ago

michaelcoconnor commented 5 years ago

It seems that if I set pmax=n where n is any integer then I don't get past the following lines in glmnet.py (around line 315): `

nx = options['pmax']; if len(nx) == 0: nx = min(ne*2 + 20, nvars)`

Of course if pmax is an integer it doesn't have a length, so that seems to be the problem. This seems to originate in glmnetSet.py where the default value of pmax is set to scipy.empty([0]) which has a length of zero.

Upon encountering this I attempted a fix by replacing scipy.empty([0]) in glmnetSet.py with None and revising the code at about line 315 of glmnet.py to:

nx = options['pmax']; if nx is None: nx = min(ne*2 + 20, nvars)

Then if I do a run with pmax=nvars everything is fine. However, if I set pmax<nvars, say 8 instead of 10, I get Warning: Non-fatal error in glmnet library call... with error codes that varied if I changed pmax.

I have traced where nx is submitted to the Fortran code but don't see anything that could cause an error (but I'm no expert about any of this).

So then it occurred to me that, like the participants in this matter, I found the actual meaning of dfmax and pmax to be obscure... thanks in no small measure to the indefinite wording of this. So I tried setting pmax=None (in my modified code) but dfmax=n where n was varied. No errors were encountered but if n was set to, say, 2, then the number of non-zero betas was unaffected and exceeded 2. So I'm at a loss as to how to proceed, to realize the promise of dfmax and pmax. And I don't know if my fix of the integer problem is really OK.

michaelcoconnor commented 5 years ago

I have just found some additional information on pmax and dfmax. Scroll down about half way to the elnet call arguments. There as in this project's code pmax is nx internally and dfmax is ne.

michaelcoconnor commented 5 years ago

I cranked up the competing python-glmnet project 's code running the gaussian (aka linear) and modified it to permit a pmax entry (they do have a max_features which is the same as dfmax).

The results were exactly the same. Without setting those options the coefficients produced are the same as with the present project, and the errors upon attempting to use pmax are about the same.