civisanalytics / python-glmnet

A python port of the glmnet package for fitting generalized linear models via penalized maximum likelihood.
Other
262 stars 59 forks source link

attribute n_lambda_ may be less than specified n_lambda, but the selected coefficients vary a lot #56

Closed muhlbach closed 5 years ago

muhlbach commented 5 years ago

I have a specific problem that is very frustrating. In my application, if I specify n_lambda=100, and check the attribute coefpath it has the shape (n_samples, 5). This is expected in some cases. However, if I specify n_lambda=5, then the attribute coefpath looks very different. Here, I would expect similar behavior. Is there any good explanation?

stephen-hoover commented 5 years ago

Hello @muhlbach ! glmnet automatically determines a grid of lambda values at which to fit a model. My understanding is that it determines the maximum value from data, and the minimum value using the maximum value and the min_lambda_ratio parameter. The grid of lambda parameters is evenly spaced on a log scale between these two ends. There's an early stopping criterion which will let the code stop before reaching all points in the grid, which is why the lambda path doesn't always have length n_lambda.

When you say that coef_path_ with n_lambda=5 looks very different from coef_path_ with n_lambda=100, do you mean that the values of the coefficients at each step aren't the same, and that you don't have 5 steps when n_lambda=5? If that's the case, it sounds like what's happening is that the grid spacing is different when you select n_lambda=100 and n_lambda=5, giving you different paths for the coefficients. You can see the actual lambda values used in the lambda_path_ attribute.

There's another source for run-to-run variation, which is that the cross-validation folds used for testing a model are randomly selected. You didn't mention anything about seeds, but if you aren't already, you can fix the random number generator seed with the random_state parameter. That shouldn't really affect this particular test, since the lambda grid is constructed deterministically, but in some cases it might result in small changes in the early stopping point.

What would you like to accomplish by setting n_lambda to 5? Perhaps there's another way to achieve the same goal.

muhlbach commented 5 years ago

Hi @stephan-hoover! Thank you for the answer!

In both cases (n_lambda=5 and n_lambda=100) the coefpath has the shape (n_samples, 5), which I find strange. But even if that is correct, a different number of variables is selected. For n_lambda=100, it selects only two variables for each of the 5 columns in coefpath. But if n_lambda=5, it selects a varying number, starting from zero and then increasing (which is expected).

I need it for soft thresholding, for instance, I need the “5 most significant variables” or the “10 most significant variables”. Therefore, I always specify n_lambda to be very high, and then select the column in coefpath that gives me the desired number of variables. I hope it makes sense.

stephen-hoover commented 5 years ago

Yes, that makes sense. As you say, the number of coefficients should gradually vary between 0 and the maximum. I am surprised that n_lambda=5 gives you the expected range from 0 to n_coefficients non-zero coefficients, but n_lambda=100 selects two non-zero coefficients at each step. I would have expected the same behavior of varying from 0 to n_coefficients, especially if you saw that when Are you able to provide code and data that would let me reproduce?

I have a couple of clarifying questions. First, is this a regression problem using glmnet.ElasticNet? Second, you mention that your coef_path_ has shape (n_samples, 5) -- do you mean (n_coefficients, 5)?

muhlbach commented 5 years ago

I will provide the code. Which e-mail can I send it to?

I mean number if coefficients, sorry.

— Nicolaj Nørgaard Mühlbach, PhD student M: (+45) 30 23 25 75

Den 20. mar. 2019 kl. 21.53 skrev Stephen Hoover notifications@github.com<mailto:notifications@github.com>:

Yes, that makes sense. As you say, the number of coefficients should gradually vary between 0 and the maximum. I am surprised that n_lambda=5 gives you the expected range from 0 to n_coefficients non-zero coefficients, but n_lambda=100 selects two non-zero coefficients at each step. I would have expected the same behavior of varying from 0 to n_coefficients, especially if you saw that when Are you able to provide code and data that would let me reproduce?

I have a couple of clarifying questions. First, is this a regression problem using glmnet.ElasticNet? Second, you mention that your coefpath has shape (n_samples, 5) -- do you mean (n_coefficients, 5)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/civisanalytics/python-glmnet/issues/56#issuecomment-475023663, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ad91mgmdhHTtmaucAF2HEebbwEsAnRjbks5vYp-4gaJpZM4b_v-t.

stephen-hoover commented 5 years ago

Are you able to post it in this thread, or link to a gist?

muhlbach commented 5 years ago

Sure. Code and data are attached. See link (file is too large to include here): https://www.dropbox.com/s/45da8ytw0q73a9h/example.zip?dl=0

stephen-hoover commented 5 years ago

Thanks. I will investigate and get back to you.

stephen-hoover commented 5 years ago

@muhlbach , I was able to take a look at your code and data, and I think I see what's going on. The test code I got from you is

# example

# Load
X = np.load('X.npy')
y = np.load('y.npy')

# Run with n_lambda=100
test1 = ElasticNet(alpha=1,
                  cut_point=1,
                  fit_intercept=False,
                  n_lambda=100,
                  standardize=False,
                  tol=1e-05,
                  max_iter=10000,
                  verbose=0)

test1.fit(X,y)
test1.coef_path_

# Run with n_lambda=5
test2 = ElasticNet(alpha=1,
                  cut_point=1,
                  fit_intercept=False,
                  n_lambda=5,
                  standardize=False,
                  tol=1e-05,
                  max_iter=10000,
                  verbose=0)

test2.fit(X,y)
test2.coef_path_

I see that these models are being fit without an intercept term, but the mean of the targets is approximately 2.5. glmnet does early stopping by looking at the percent of deviance explained at each step along the lambda path. It also sets the start of the path by finding the value of lambda which forces all coefficients to be zero, and without an intercept, this needs to be quite high. In this situation, without an intercept, the "percent of deviance explained" comes out greater than 1.

I don't know exactly how glmnet calculates when to stop, but apparently having a percent of deviance explained greater than 1 makes it think that it's not worth continuing along the lambda path, and it stops after 5 steps (which looks like the minimum).

This question is addressing the same problem that you're running into: https://stats.stackexchange.com/questions/243347/why-is-cv-glmnet-returning-absurd-coefficients-when-intercept-term-is-omitted

I suggest that you either include an intercept term, or that you manually input a lambda path (using the lambda_path parameter of ElasticNet). Does this help?

muhlbach commented 5 years ago

Oh, I see. I like the explanation, thank you! However, I have to run without intercept as the coefficients are individual forecasts which I need to combine, and it doesn’t make too much sense to include an intercept. I’ll look into specifying the path.

Thanks again for your answers!

— Nicolaj Nørgaard Mühlbach, PhD student M: (+45) 30 23 25 75

Den 22. mar. 2019 kl. 17.16 skrev Stephen Hoover notifications@github.com<mailto:notifications@github.com>:

@muhlbachhttps://github.com/muhlbach , I was able to take a look at your code and data, and I think I see what's going on. The test code I got from you is

example

Load

X = np.load('X.npy') y = np.load('y.npy')

Run with n_lambda=100

test1 = ElasticNet(alpha=1, cut_point=1, fit_intercept=False, n_lambda=100, standardize=False, tol=1e-05, max_iter=10000, verbose=0)

test1.fit(X,y) test1.coefpath

Run with n_lambda=5

test2 = ElasticNet(alpha=1, cut_point=1, fit_intercept=False, n_lambda=5, standardize=False, tol=1e-05, max_iter=10000, verbose=0)

test2.fit(X,y) test2.coefpath

I see that these models are being fit without an intercept term, but the mean of the targets is approximately 2.5. glmnet does early stopping by looking at the percent of deviance explained at each step along the lambda path. It also sets the start of the path by finding the value of lambda which forces all coefficients to be zero, and without an intercept, this needs to be quite high. In this situation, without an intercept, the "percent of deviance explained" comes out greater than 1.

I don't know exactly how glmnet calculates when to stop, but apparently having a percent of deviance explained greater than 1 makes it think that it's not worth continuing along the lambda path, and it stops after 5 steps (which looks like the minimum).

This question is addressing the same problem that you're running into: https://stats.stackexchange.com/questions/243347/why-is-cv-glmnet-returning-absurd-coefficients-when-intercept-term-is-omitted

I suggest that you either include an intercept term, or that you manually input a lambda path (using the lambda_path parameter of ElasticNet). Does this help?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/civisanalytics/python-glmnet/issues/56#issuecomment-475683025, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ad91mn-sJPomj1LszFB0gUUtXFZwBTgXks5vZQHTgaJpZM4b_v-t.

stephen-hoover commented 5 years ago

You're welcome!