Closed djsutherland closed 7 years ago
@alansaul is that right? Maybe there is the problem of what you where facing?
Oops this must have slipped through the testing net. I believe this particular memory explosion issue might actually be with the shape of the Y_metadata, it must be the same shape a Y:
Y_metadata={'trials': ns[:, None]}
We should obviously do a check for this, I will add a check and a test case. However it seems like we have never made use of Laplace with the Binomial likelihood since d3logpdf_dlink3 has never been implemented. If you would like to make a pull request that would be great, if not, I don't mind doing it when I get a minute next week.
Sorry for the inconvenience! Alan
Scratch that, I found a minute. @dougalsutherland I haven't tested that this gives sensible results, can you take a look with an example you expect to work and see if it does anything reasonable? Sampling may not work yet so to generate predictions for Y you may need to change "samples" in likelihoods/binomials.py
Thanks! Personally, I find requiring 2d Y
extremely counter-intuitive, and didn't even consider that.
Comments about #356 over there.
Well, that is why we make you do it! It is mathematical the right thing to give a column vector into a mathematical framework. You should get used to these things, as as soon as you start working with mathematical code, which allows for dimensions it will decrease your rate of error significantly!
@mzwiessele, I just wanted to let you know that as an almost-graduated machine learning PhD student who does most of my numerical work in Python, your comment comes across as quite condescending. I'm not going to argue the point since GPy has made its decision and it's a waste of time, but the claim that it's more mathematically pure is questionable, whether it leads to reduced bugs is extremely questionable, and it is exceedingly unpythonic. In any case, since I would expect most regular Python users to make the same "mistake" (especially since such things are essentially undocumented), having checks on the size everywhere it matters (as @alansaul introduced here in #356) is I think important.
No offence at all. I have seen many starting people not using dimensions right, which introduces unforseeable results, especially with broadcasting. Therefore, we try to make the user aware of these issues. But I get your point, without the checks that is not good enough. We are working on putting all the checks in. So good spotting and thanks for the input, we are fixing it for the next patch.
On 04 Apr 2016, at 19:55, Dougal J. Sutherland notifications@github.com wrote:
@mzwiessele, I just wanted to let you know that as an almost-graduated machine learning PhD student who does most of my numerical work in Python, your comment comes across as quite condescending. I'm not going to argue the point since GPy has made its decision and it's a waste of time, but the claim that it's more mathematically pure is questionable, whether it leads to reduced bugs is extremely questionable, and it is exceedingly unpythonic. In any case, since I would expect most regular Python users to make the same "mistake" (especially since such things are essentially undocumented), having checks on the size everywhere it matters (as @alansaul introduced here in #356) is I think important.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
Hi Dougal,
Yes, it's a design decision, which perhaps we should be clearer about.
All design matrices in statistics have data along the rows. Unfortunately numpy has some fairly strange implementation issues with regard to matrices and arrays. Matrices are more natural in many respects for containing data, but basically they are a non starter (because inconsistent handling of matrices and arrays can lead to horrible bugs so matrices are never used).
Normal 1-d arrays are very confusing because they behave like row vectors (I.e. indexed along columns). But traditionally in maths the first dimension is always rows.
This issue may emerge from choosing row major order in numpy, who knows, but it's very odd and confusing when you are treating arrays as matrices (which is what we do all the time in GPy).
This is one area where MATLAB and Julia are a long way ahead of python. We are not not wedded to python, and don't really wish to be too constrained by what is perceived as pythonic. After all python wasn't originally designed (and neither was numpy) with Gaussian process models in mind ... I think we also aren't the only researchers to identify these issues.
We made an early decision to explicitly use 2-d arrays for data, and while any solution is a compromise, I believe that we have the right one for our purposes.
You shouldn't take too much offence at Max trying to explain that quickly, he's trying to close a number of issues and it's a lot of work. Best thing to do if you see some way of fixing things or making it clearer for future users is to submit a pull request.
Thanks for getting involved!
Neil On 4 Apr 2016 20:16, "Max Zwiessele" notifications@github.com wrote:
No offence at all. I have seen many starting people not using dimensions right, which introduces unforseeable results, especially with broadcasting. Therefore, we try to make the user aware of these issues. But I get your point, without the checks that is not good enough. We are working on putting all the checks in. So good spotting and thanks for the input, we are fixing it for the next patch.
On 04 Apr 2016, at 19:55, Dougal J. Sutherland notifications@github.com wrote:
@mzwiessele, I just wanted to let you know that as an almost-graduated machine learning PhD student who does most of my numerical work in Python, your comment comes across as quite condescending. I'm not going to argue the point since GPy has made its decision and it's a waste of time, but the claim that it's more mathematically pure is questionable, whether it leads to reduced bugs is extremely questionable, and it is exceedingly unpythonic. In any case, since I would expect most regular Python users to make the same "mistake" (especially since such things are essentially undocumented), having checks on the size everywhere it matters (as @alansaul introduced here in #356) is I think important.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/SheffieldML/GPy/issues/352#issuecomment-205453376
Hi Neil,
Thanks for the reply. I maybe came across as more offended than I actually was – sorry about that. It's a stupid argument anyway.
Deciding to explicitly use column vectors everywhere is of course a fine decision, it's just very distinct from what other scientific/ML Python software uses (scikit-learn, theano, ...), and I at least didn't really see it mentioned anywhere in the documentation. It's probably best to add that somewhere in the introductory materials, which right now are kind of scattered, and especially to make sure that there are checks everywhere to error fast on that instead of letting it propagate through code that assumes otherwise and end up trying to allocate 7-terabyte matrices. :) If I run into other places where this seems like a problem I'll certainly open issues or send pull requests as needed.
Thanks for making this library so that I don't have to drop into matlab just for GP fitting anymore. :) – Dougal
No probs Dougal, thanks again for getting involved! We'll try and add a section about this somewhere at some point.
On Mon, Apr 4, 2016 at 9:44 PM, Dougal J. Sutherland < notifications@github.com> wrote:
Hi Neil,
Thanks for the reply. I maybe came across as more offended than I actually was – sorry about that. It's a stupid argument anyway.
Deciding to explicitly use column vectors everywhere is of course a fine decision, it's just very distinct from what other scientific/ML Python software uses (scikit-learn, theano, ...), and I at least didn't really see it mentioned anywhere in the documentation. It's probably best to add that somewhere in the introductory materials, which right now are kind of scattered, and especially to make sure that there are checks everywhere to error fast on that instead of letting it propagate through code that assumes otherwise and end up trying to allocate 7-terabyte matrices. :) If I run into other places where this seems like a problem I'll certainly open issues or send pull requests as needed.
Thanks for making this library so that I don't have to drop into matlab just for GP fitting anymore. :)
- Dougal
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/SheffieldML/GPy/issues/352#issuecomment-205487013
Hi. After fixing the issue with the vector format, I am getting this error:
`--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last)
@DuGuille That's because #356 hasn't been merged yet. You can check out the binomial_laplace
branch and it should work, though predict
won't get as I mentioned over there.
Thanks. I noticed it yesterday. I opened another issue about it but I'm closing it and commenting on the pull request thread.
Solved by #356.
Laplace inference appears to be broken with binomial likelihoods. Here is a notebook demonstrating the problem.
Specifically, with 1000 inputs, it's trying to allocate a 1000000 x 1000000 matrix, which of course fails since my machine doesn't have 7 terabytes of free memory. But that matrix should only be 1000 x 1000. Looking into it a little bit, it seems that the
W
matrix made here, which should be 1000 x 1, is actually 1000 x 1000, with each of the rows apparently identical.This doesn't happen with a Poisson likelihood, so the error is probably in
BinomialLikelihood.d2logpdf_df2
. But I haven't tried to dig into those functions further.