Heteroscedastic and sparse GP

astrozot commented 6 years ago

I am just learning how to use GPy, and for a specific project I would like to use a heteroscedastic GP model. My project, in particular, involves measurements some parameters for a few millions of stars, for which I would need to interpolate in order to obtain a smooth map. Each measurement is associated with an error, which can be taken to be normally distributed with known variance; the variances of the various measurements are all different.

So far I have been using standard interpolation techniques, such as moving averages. I would like to try something more powerful, such as Gaussian Process Regression. However, because of the size of my dataset, I clearly need to use a sparse GP method.

Is there any implementation of a sparse heteroscedastic GP model in GPy? If not, how difficult would it be to make it?

mzwiessele commented 6 years ago

I believe you can just use the HeteroscedasticGaussian for this:

# X and Y are N by 1 datasets
Z = np.random.choice(X[:,0], 7)[:,None] # Where 7 is the number of inducing inputs
variances = np.random.uniform(0.01, 0.2, (X.shape[0], 1))
lik = GPy.likelihoods.HeteroscedasticGaussian({'output_index':np.array(0)}, variance=variances)
lik.fix()
# variances are the variances for each datapoint
m = GPy.core.SparseGP(X, Y, Z, 
   GPy.kern.RBF(1), lik.copy(), 
   Y_metadata={'output_index':np.arange(X.shape[0])}, 
   inference_method=GPy.inference.latent_function_inference.VarDTC()
)
# m.optimize(messages=1)
# print(m)

Let me know how it goes, I'm curious if it gives promising results :)

astrozot commented 6 years ago

Dear Max,

thank you for your quick reply! I confess it looks like black magic for me... but it seems to work! I think a can guess a few lines, but others are not so understandable to me (for example the need for lik.fix or the inference_method parameter). I can only blame my ignorance.

Anyway, when I used it I found something unexpected. The method, as I noted, works nicely: compared to a standard sparse regression, for example, it performs better (which suggests the errors are really taken into account). However, the confidence intervals reported by m.plot show a large discontinuities:

I am also a little confused by the m.plot command, since if I compute the error bars myself using m.predict_noiseless on a range of points I do not see any discontinuity (blue lines refer to the heteroscedastic fit, red ones to the standard GP regression):

Again, I can only blame my ignorance, but I will appreciate any help that can shed some light on this.

Thank you, Marco

P.S. I am enclosing below the code to reproduce what I am saying.

np.random.seed(2);

N = 1000
X = np.random.uniform(low=0, high=10, size=(N, 1))
V = np.random.uniform(0.1, 3.0, size=(N, 1))**2
Y = np.sin(X) + np.random.normal(size=(N, 1)) * np.sqrt(V)
plt.plot(X, Y, '.');

M = 100 # Number of inducing points
Z = np.random.choice(X[:,0], M)[:, None]
lik = GPy.likelihoods.HeteroscedasticGaussian({'output_index': np.array(0)}, variance=V)
lik.fix();
m = GPy.core.SparseGP(X, Y, Z, 
   GPy.kern.RBF(1), lik.copy(), 
   Y_metadata={'output_index': np.arange(X.shape[0])}, 
   inference_method=GPy.inference.latent_function_inference.VarDTC()
);
m.optimize(messages=1)

m1 = GPy.models.SparseGPRegression(X, Y, GPy.kern.RBF(1), Z);
m1.optimize(messages=1);

x1 = np.linspace(0, 10, 1000)
fig = m.plot(plot_density=False, plot_data=False)
fig.plot(x1, np.sin(x1), 'r')
GPy.plotting.show(fig);

x1 = np.linspace(0, 10, 1000)
fig = m1.plot(plot_density=False, plot_data=False)
fig.plot(x1, np.sin(x1), 'r')
GPy.plotting.show(fig);

p = m.predict_noiseless(x1.reshape((-1,1)))
plt.plot(x1, p[0], 'b');
plt.plot(x1, p[0] - 3*np.sqrt(p[1]), 'b--');
plt.plot(x1, p[0] + 3*np.sqrt(p[1]), 'b--');
p1 = m1.predict_noiseless(x1.reshape((-1,1)))
plt.plot(x1, p1[0], 'r');
plt.plot(x1, p1[0] - 3*np.sqrt(p1[1]), 'r--');
plt.plot(x1, p1[0] + 3*np.sqrt(p1[1]), 'r--');
plt.plot(x1, np.sin(x1), 'k.');

print(np.sum((p[0].flatten() - np.sin(x1))**2))
print(np.sum((p1[0].flatten() - np.sin(x1))**2))

mzwiessele commented 6 years ago

1: fixing the likelihood means that you do not learn the noise variance parameters, which you assume as given.

2: when you use m.plot, you plot the posterior of y, including the likelihood variance. Try using m.plot_f, which plots the posterior of f, as you did when using raw_predict.

3: it seems you are using too many inducing inputs, I am certain you can get away with a lot less (say 10-30) in the above example.

4: you can ignore the inference method, that will be chosen automatically for you at initialization. It is only there for completeness.

mzwiessele commented 6 years ago

Btw. You can choose any kernel which provides the gradients wrt x k.gradients_X method.

astrozot commented 6 years ago

Thank you again!

Regarding the number of inducing points, you are of course right: I can get away with ~15. However, if I understand it correctly, I should use a number of inducing points that can capture the complexity of the underlaying function (which is not always easy to tell...). In the real application I have in mind, I expect to have to deal with hundreds of them.

As a last comment, I tried to increase the number of points, but it looks like I will hit soon a limit where the computation time is prohibitive. Therefore, I am not sure I can really use a GP regression for my problem anymore... In this respect, I am (naively) thinking that the fact that distant points are essentially uncorrelated should help somewhat the algorithm. More specifically, one could use a kernel that is exactly zero for distant points: in this case, I presume, one could take advantage of the fact that many of the kernel matrices of the algorithm would be sparse. Is this fact used in GPy? Do you have any other suggestion to make the problem tractable when I have to deal with a few millions of points?

mzwiessele commented 6 years ago

There is the stochastic variant SVIGP, which allows for stochastic gradient descent and batch updates. I am not certain of how to do heteroscedastic noise on that one, though. One way could be to put the noise variance on the inputs, or somehow use a heteroscedastic likelihood. I think that will include some implementation changes though. I have never tried it before, maybe it just works, too.

If you find a way, we’d appreciate you put in a pull request to allow for this in GPy in the future.

On 19. Feb 2018, at 18:22, astrozot notifications@github.com wrote:

Thank you again!

Regarding the number of inducing points, you are of course right: I can get away with ~15. However, if I understand it correctly, I should use a number of inducing points that can capture the complexity of the underlaying function (which is not always easy to tell...). In the real application I have in mind, I expect to have to deal with hundreds of them.

As a last comment, I tried to increase the number of points, but it looks like I will hit soon a limit where the computation time is prohibitive. Therefore, I am not sure I can really use a GP regression for my problem anymore... In this respect, I am (naively) thinking that the fact that distant points are essentially uncorrelated should help somewhat the algorithm. More specifically, one could use a kernel that is exactly zero for distant points: in this case, I presume, one could take advantage of the fact that many of the kernel matrices of the algorithm would be sparse. Is this fact used in GPy? Do you have any other suggestion to make the problem tractable when I have to deal with a few millions of points?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

SheffieldML / GPy

Heteroscedastic and sparse GP #602