dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python
https://pygam.readthedocs.io
Apache License 2.0
856 stars 156 forks source link

GAMLSS #154

Open xi2pi opened 6 years ago

xi2pi commented 6 years ago

I tried to compute similar percentile curves like I would get with the gamlss package in R (gamlss.com). In this case, we only have one independent and one dependent variable. An example plot is shown here: http://www.gamlss.com/version-4-1-2/

I was not successful with GAM(). LinearGAM() seems to be good for that, however it is only applicable for normal distribution. Is there a possiblity to extend the functionality?

dswah commented 6 years ago

@xi2pi yes. to find prediction intervals for other distributions you should try the empirical sampling approach.

gam.fit(X, y) # your fitted model
samples = gam.sample(X, y, quantity='y', n_draws=500, sample_at_X=X)
# sampels is shape (500, len(y))

percentiles = np.percentile(samples, q=[2.5, 97.5], axis=0)
#percentiles is now shape (2, len(y))

and now you have an empirical distribution for your dependent/response variable conditioned on your independent variable.

Note: you should tune the n_draws to suit your precision needs and your tolerance for data size.

i would like to clarify that if you are looking instead for confidence intervals, then you can do that using the following:

confi = gam.confidence_intervals(X, width=.95)

and this works for all of the distributions.

dswah commented 6 years ago

please let me know how this works for you!

dswah commented 6 years ago

@conduit242 unfortunately, no. we need to have a fitted GAM first. we are drawing the samples from p(y|X) so we need to have fitted GAM coefficients that tell us how to get mu = E[y] from X. also some distributions (like the laplace) require a scale parameter that we also need to estimate.

conduit242 commented 6 years ago

Ack, I didn't read the original issue clearly enough, my question makes no sense at all! I'm gonna delete that to remove noise.

xi2pi commented 6 years ago

@dswah, thanks for the answer! Getting close to a solution but still have some issues.

Code: pygam example code

Figure: pygam example figure

Problem is that the lines are not straight. I tried to adjust n_draws but without success.

It would also be great if I got similar results like with the GAMLSS package in R. I guess that is not easy as pygam seems to have another approach though.

dswah commented 6 years ago

@xi2pi ahh i think the problem is mostly due to the fact that your X data are discrete, and thus you are generating multiple samples for many of your X values.

You should consider changing the argument sample_at_X to be a grid of uniformly sampled X values like the following:

from pygam.utils import generate_X_grid
XX = generate_X_grid(gam, n=100)

i tried increasing the n_draws to around 50,000 and got pretty smooth lines. of course the intervals generated using this method will be slightly noisy because they are based on estimated percentiles, which are random variables.

you can also play with changing the resolution the resolution of the X grid, but this directly impacts the sampling effort.

the result is not too noisy:

from pygam import LinearGAM
from pygam.utils import generate_X_grid

gam = LinearGAM(n_splines=10).fit(X, y) # your fitted model

# change resolution of X grid
XX = generate_X_grid(gam, n=100)
samples = gam.sample(X, y, quantity='y', n_draws=50000, sample_at_X=XX)

# sampels is shape (len(XX), len(y))

percentiles = np.percentile(samples, q=[2.5, 97.5], axis=0)
#percentiles is now shape (2, len(y))

plt.figure(figsize=(10,8))
plt.scatter(X, y)
plt.plot(XX, percentiles[0])
plt.plot(XX, percentiles[1])
plt.savefig("pygam_example.png")
plt.show()

pygam_example

Since this is a linear model, you can compare these intervals to the ones generated deterministically using:

XX = generate_X_grid(gam, n=100)

plt.figure(figsize=(10,8))
plt.scatter(X,y)
plt.plot(XX, gam.prediction_intervals(XX, quantiles=[.025, .975]))
plt.savefig("pygam_example.png")
plt.show()

pygam_example

NB i tried to create data similar to yours:

# i see that your data are discrete in both X and y. 
# i created sample data that is similar to it.
import numpy as np

N = 75

t = np.linspace(1, 2, N)
density = np.cos(t * 2 * np.pi) ** 2 * 10

n_samples = np.random.poisson(density, N) + 2

X = []
for n, x in zip(n_samples, np.linspace(1, 2, N)):
    X += n * [x]

X = np.array(X)
y = np.random.poisson(X*10)

ps sorry for the slow response, i was camping this weekend.

xi2pi commented 6 years ago

@dswah great, that helped a lot! I tried the code with some more data: pygam_example_3

Python code The result looks really great! Thanks for this great package. BTW, you can find all data, code and figures in my repository: gamlss repository

Now, back to my question about the distribution. Is it possible to add another distribution to pyGAM? I would like to use the Box-Cox Cole Green distribution.

Some background information about my application field: I work on pediatric reference curves (Wikipedia example) and we use R/gamlss and the LMS method by Cole for the computation (it is recommended by the World Health Organization). We also developed a little Python program for physicians and scientist. Problem is that it uses R as subprocess. I woul like to make it an all-in-Python solution. To sum it up, it would be great if we could receive similar or approximately the same result with pyGAM like we do with gamlss. Do you think there is a chance that this might work?

dswah commented 6 years ago

@xi2pi that looks awesome!!

it is easy to add distributions to pyGAM, but unfortunately this one cannot be be solved with pyGAM since it requires making smooths for the mean, scale, and skew.

pyGAM can only smooth the mean.

however, we can certainly make arbitrary centile curves for your data, like the ones you showed in the wikipedia article.

we can achieve this by adding the Asymmetric Laplacian distribution to pyGAM. this distribution can regress arbitrary percentiles.

I was working on adding the Laplacian distribution to pyGAM for robust regression, and adding the asymmetric version would be trivial.

dswah commented 6 years ago

hmm it might not be possible to add the Laplace distribution as it is not a member of the exponential family...

we'll see.

xi2pi commented 6 years ago

Thanks! btw, I opened the same issue in the forum for the statsmodel package

Also, my quick solution with maximum likelihood and polynoms for median, scale and skew seems to work.

Let's see!

xi2pi commented 5 years ago

@dswah: sorry, that this question is a bit out of context but it might be a profit for this repository later: as already mentioned, the LMS method proposed by Cole in 1992 is widely used for medical purposes to create reference curves (2000 citations). As there is no method for that in Python, I started implementing this approach - however my inner iteration loop fails to converge.

I posted the question on stackexchange: Generalized Additive Models: How to fit models with the LMS method by Cole?.

Also, I uploaded a script here: gamlss_BCCG_Cole.py

Do you maybe have an idea how to solve that problem?

dswah commented 5 years ago

@xi2pi im not familiar at all with this algorithm... but from running your code, it looks like the estimate for L (lambda?) diverges. it grows larger at every iteration. have you double-checked the signs in your update equations?

xi2pi commented 5 years ago

thanks for taking a look. Yes, L, which is equal to lambda, increases every iteration step. I double-checked almost everything (signs, derivatives, starting values, etc.). Only thing I did not check are the second order derivatives (W) but I trust Cole that they are correct in the publication.

I might have misunderstood the inner loop that should look like a basic back fitting algorithm image

xi2pi commented 5 years ago

ok, a classical typing error - my bad. There was a mistake in the main equation (9) in the inner loop. I have honestly work quite a while on it, before I gave up and posted it today. The fitting works now - a bit slow, but acceleration is another chapter: gamlss_BCCG_Cole.py

Thanks for the help!

Results look like that:

image