SheffieldML / GPy

Gaussian processes framework in python
BSD 3-Clause "New" or "Revised" License
1.99k stars 558 forks source link

Laplace inference not working with binomial likelihood #352

Closed djsutherland closed 7 years ago

djsutherland commented 8 years ago

Laplace inference appears to be broken with binomial likelihoods. Here is a notebook demonstrating the problem.

Specifically, with 1000 inputs, it's trying to allocate a 1000000 x 1000000 matrix, which of course fails since my machine doesn't have 7 terabytes of free memory. But that matrix should only be 1000 x 1000. Looking into it a little bit, it seems that the W matrix made here, which should be 1000 x 1, is actually 1000 x 1000, with each of the rows apparently identical.

This doesn't happen with a Poisson likelihood, so the error is probably in BinomialLikelihood.d2logpdf_df2. But I haven't tried to dig into those functions further.

mzwiessele commented 8 years ago

@alansaul is that right? Maybe there is the problem of what you where facing?

alansaul commented 8 years ago

Oops this must have slipped through the testing net. I believe this particular memory explosion issue might actually be with the shape of the Y_metadata, it must be the same shape a Y:

Y_metadata={'trials': ns[:, None]}

We should obviously do a check for this, I will add a check and a test case. However it seems like we have never made use of Laplace with the Binomial likelihood since d3logpdf_dlink3 has never been implemented. If you would like to make a pull request that would be great, if not, I don't mind doing it when I get a minute next week.

Sorry for the inconvenience! Alan

alansaul commented 8 years ago

Scratch that, I found a minute. @dougalsutherland I haven't tested that this gives sensible results, can you take a look with an example you expect to work and see if it does anything reasonable? Sampling may not work yet so to generate predictions for Y you may need to change "samples" in likelihoods/binomials.py

djsutherland commented 8 years ago

Thanks! Personally, I find requiring 2d Y extremely counter-intuitive, and didn't even consider that.

Comments about #356 over there.

mzwiessele commented 8 years ago

Well, that is why we make you do it! It is mathematical the right thing to give a column vector into a mathematical framework. You should get used to these things, as as soon as you start working with mathematical code, which allows for dimensions it will decrease your rate of error significantly!

djsutherland commented 8 years ago

@mzwiessele, I just wanted to let you know that as an almost-graduated machine learning PhD student who does most of my numerical work in Python, your comment comes across as quite condescending. I'm not going to argue the point since GPy has made its decision and it's a waste of time, but the claim that it's more mathematically pure is questionable, whether it leads to reduced bugs is extremely questionable, and it is exceedingly unpythonic. In any case, since I would expect most regular Python users to make the same "mistake" (especially since such things are essentially undocumented), having checks on the size everywhere it matters (as @alansaul introduced here in #356) is I think important.

mzwiessele commented 8 years ago

No offence at all. I have seen many starting people not using dimensions right, which introduces unforseeable results, especially with broadcasting. Therefore, we try to make the user aware of these issues. But I get your point, without the checks that is not good enough. We are working on putting all the checks in. So good spotting and thanks for the input, we are fixing it for the next patch.

On 04 Apr 2016, at 19:55, Dougal J. Sutherland notifications@github.com wrote:

@mzwiessele, I just wanted to let you know that as an almost-graduated machine learning PhD student who does most of my numerical work in Python, your comment comes across as quite condescending. I'm not going to argue the point since GPy has made its decision and it's a waste of time, but the claim that it's more mathematically pure is questionable, whether it leads to reduced bugs is extremely questionable, and it is exceedingly unpythonic. In any case, since I would expect most regular Python users to make the same "mistake" (especially since such things are essentially undocumented), having checks on the size everywhere it matters (as @alansaul introduced here in #356) is I think important.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

lawrennd commented 8 years ago

Hi Dougal,

Yes, it's a design decision, which perhaps we should be clearer about.

All design matrices in statistics have data along the rows. Unfortunately numpy has some fairly strange implementation issues with regard to matrices and arrays. Matrices are more natural in many respects for containing data, but basically they are a non starter (because inconsistent handling of matrices and arrays can lead to horrible bugs so matrices are never used).

Normal 1-d arrays are very confusing because they behave like row vectors (I.e. indexed along columns). But traditionally in maths the first dimension is always rows.

This issue may emerge from choosing row major order in numpy, who knows, but it's very odd and confusing when you are treating arrays as matrices (which is what we do all the time in GPy).

This is one area where MATLAB and Julia are a long way ahead of python. We are not not wedded to python, and don't really wish to be too constrained by what is perceived as pythonic. After all python wasn't originally designed (and neither was numpy) with Gaussian process models in mind ... I think we also aren't the only researchers to identify these issues.

We made an early decision to explicitly use 2-d arrays for data, and while any solution is a compromise, I believe that we have the right one for our purposes.

You shouldn't take too much offence at Max trying to explain that quickly, he's trying to close a number of issues and it's a lot of work. Best thing to do if you see some way of fixing things or making it clearer for future users is to submit a pull request.

Thanks for getting involved!

Neil On 4 Apr 2016 20:16, "Max Zwiessele" notifications@github.com wrote:

No offence at all. I have seen many starting people not using dimensions right, which introduces unforseeable results, especially with broadcasting. Therefore, we try to make the user aware of these issues. But I get your point, without the checks that is not good enough. We are working on putting all the checks in. So good spotting and thanks for the input, we are fixing it for the next patch.

On 04 Apr 2016, at 19:55, Dougal J. Sutherland notifications@github.com wrote:

@mzwiessele, I just wanted to let you know that as an almost-graduated machine learning PhD student who does most of my numerical work in Python, your comment comes across as quite condescending. I'm not going to argue the point since GPy has made its decision and it's a waste of time, but the claim that it's more mathematically pure is questionable, whether it leads to reduced bugs is extremely questionable, and it is exceedingly unpythonic. In any case, since I would expect most regular Python users to make the same "mistake" (especially since such things are essentially undocumented), having checks on the size everywhere it matters (as @alansaul introduced here in #356) is I think important.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/SheffieldML/GPy/issues/352#issuecomment-205453376

djsutherland commented 8 years ago

Hi Neil,

Thanks for the reply. I maybe came across as more offended than I actually was – sorry about that. It's a stupid argument anyway.

Deciding to explicitly use column vectors everywhere is of course a fine decision, it's just very distinct from what other scientific/ML Python software uses (scikit-learn, theano, ...), and I at least didn't really see it mentioned anywhere in the documentation. It's probably best to add that somewhere in the introductory materials, which right now are kind of scattered, and especially to make sure that there are checks everywhere to error fast on that instead of letting it propagate through code that assumes otherwise and end up trying to allocate 7-terabyte matrices. :) If I run into other places where this seems like a problem I'll certainly open issues or send pull requests as needed.

Thanks for making this library so that I don't have to drop into matlab just for GP fitting anymore. :) – Dougal

lawrennd commented 8 years ago

No probs Dougal, thanks again for getting involved! We'll try and add a section about this somewhere at some point.

On Mon, Apr 4, 2016 at 9:44 PM, Dougal J. Sutherland < notifications@github.com> wrote:

Hi Neil,

Thanks for the reply. I maybe came across as more offended than I actually was – sorry about that. It's a stupid argument anyway.

Deciding to explicitly use column vectors everywhere is of course a fine decision, it's just very distinct from what other scientific/ML Python software uses (scikit-learn, theano, ...), and I at least didn't really see it mentioned anywhere in the documentation. It's probably best to add that somewhere in the introductory materials, which right now are kind of scattered, and especially to make sure that there are checks everywhere to error fast on that instead of letting it propagate through code that assumes otherwise and end up trying to allocate 7-terabyte matrices. :) If I run into other places where this seems like a problem I'll certainly open issues or send pull requests as needed.

Thanks for making this library so that I don't have to drop into matlab just for GP fitting anymore. :)

  • Dougal

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/SheffieldML/GPy/issues/352#issuecomment-205487013

Guitlle commented 8 years ago

Hi. After fixing the issue with the vector format, I am getting this error:

`--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last)

in () 10 likelihood=GPy.likelihoods.Binomial(), 11 Y_metadata={'trials': ns[:, None]}, ---> 12 inference_method=GPy.inference.latent_function_inference.Laplace(), 13 ) C:\Anaconda3\lib\site-packages\paramz\parameterized.py in **call**(self, _args, *_kw) 52 self._model_initialized_ = initialize 53 if initialize: ---> 54 self.initialize_parameter() 55 else: 56 import warnings C:\Anaconda3\lib\site-packages\paramz\core\parameter_core.py in initialize_parameter(self) 328 self._highest_parent_._connect_fixes() 329 self._highest_parent_._connect_parameters() #logger.debug("calling parameters changed") --> 330 self.parameters_changed() 331 332 @property C:\Anaconda3\lib\site-packages\GPy\core\gp.py in parameters_changed(self) 188 this method yourself, there may be unexpected consequences. 189 """ --> 190 self.posterior, self._log_marginal_likelihood, self.grad_dict = self.inference_method.inference(self.kern, self.X, self.likelihood, self.Y_normalized, self.mean_function, self.Y_metadata) 191 self.likelihood.update_gradients(self.grad_dict['dL_dthetaL']) 192 self.kern.update_gradients_full(self.grad_dict['dL_dK'], self.X) C:\Anaconda3\lib\site-packages\GPy\inference\latent_function_inference\laplace.py in inference(self, kern, X, likelihood, Y, mean_function, Y_metadata) 141 142 #Compute hessian and other variables at mode --> 143 log_marginal, woodbury_inv, dL_dK, dL_dthetaL = self.mode_computations(f_hat, Ki_fhat, K, Y, likelihood, kern, Y_metadata) 144 145 self._previous_Ki_fhat = Ki_fhat.copy() C:\Anaconda3\lib\site-packages\GPy\inference\latent_function_inference\laplace.py in mode_computations(self, f_hat, Ki_f, K, Y, likelihood, kern, Y_metadata) 251 252 # Compute matrices for derivatives --> 253 dW_df = -likelihood.d3logpdf_df3(f_hat, Y, Y_metadata=Y_metadata) # -d3lik_d3fhat 254 if np.any(np.isnan(dW_df)): 255 raise ValueError('One or more element(s) of dW_df is NaN') C:\Anaconda3\lib\site-packages\GPy\util\misc.py in wrapper_func(self, _args, *_kwargs) 151 def wrapper_func(self, _args, *_kwargs): 152 # Invoke the wrapped function first --> 153 retval = func(self, _args, *_kwargs) 154 # Now do something here with retval and/or action 155 if self.not_block_really and (len(retval.shape) < 3): C:\Anaconda3\lib\site-packages\GPy\likelihoods\likelihood.py in d3logpdf_df3(self, f, y, Y_metadata) 511 else: 512 inv_link_f = self.gp_link.transf(f) --> 513 d3logpdf_dlink3 = self.d3logpdf_dlink3(inv_link_f, y, Y_metadata=Y_metadata) 514 dlink_df = self.gp_link.dtransf_df(f) 515 d2logpdf_dlink2 = self.d2logpdf_dlink2(inv_link_f, y, Y_metadata=Y_metadata) C:\Anaconda3\lib\site-packages\GPy\likelihoods\likelihood.py in d3logpdf_dlink3(self, inv_link_f, y, Y_metadata) 378 379 def d3logpdf_dlink3(self, inv_link_f, y, Y_metadata=None): --> 380 raise NotImplementedError 381 382 def dlogpdf_link_dtheta(self, inv_link_f, y, Y_metadata=None): NotImplementedError: `
djsutherland commented 8 years ago

@DuGuille That's because #356 hasn't been merged yet. You can check out the binomial_laplace branch and it should work, though predict won't get as I mentioned over there.

Guitlle commented 8 years ago

Thanks. I noticed it yesterday. I opened another issue about it but I'm closing it and commenting on the pull request thread.

djsutherland commented 7 years ago

Solved by #356.