Yelp / MOE

A global, black box optimization engine for real world metric optimization.
Other
1.31k stars 139 forks source link

[Python] figure out whether we really need the SVD for singular or near-singular variance matrices in EI #325

Open suntzu86 opened 10 years ago

suntzu86 commented 10 years ago

RELATED TO: #314

In #257, I catch cholesky failures and use a combination of the SVD and QR factorizations to handle numerically SPSD matrices smoothly. This is probably overkill but performance wasn't a huge concern and we needed a quick fix.

A few points of further investigation come to mind:

  1. U = V in the SVD as long as the matrix is SPSD. We probably can have numerically indefinite matrices. It's probably worth double checking that U = V and throwing an error if not.
  2. eigenvalue decomposition and the LDL factorization are also options (but may not be as well-behaved as the SVD)
  3. we could even do things as simple/silly as adding a small multiple (like 1.0e-16) of the identity matrix to the variance matrix before factorization.
  4. (LONG TERM) what's nice about the SVD here is that it effectively removes the points that are duplicates/causing problems. So the EI of 8 points with 3 duplicates will come out to be the same as the EI of the 5 unique points. There are other, cheaper low rank approximation methods (several papers published on this topic) for dealing with this very situation even in the context of GPs

These hard cases also have the potential to break our analytic qd EI implementation (causing the NaNs that are being eaten in #314)

mikemccourt commented 10 years ago

Scott and I had a discussion about a similar situation (in a different context) some time ago, but I think I can give you some ideas here since the internet is the better place to talk about stuff. Below you can find a bunch of crap about solving matrix problems associated with Gaussian Processes - I've typed it up rather haphazardly so feel free to ask questions if things don't make sense. If there's a term or reference that you're not familiar with, I'll fix it. I hope that some of this helps - if not, I have other stuff as well. Also, I never post on online sites, so if I'm doing this wrong, lemme know.

The covariance matrix associated with a MVN is symmetric positive definite (under the common assumption that the covariance kernel for the underlying Gaussian process is symmetric and positive definite, though I have seen people use conditionally positive definite kernels with filtering), so you have more options available than for just generic matrices. Even so, you don't necessarily need to work with a symmetric decomposition of your matrix. My own research on stable Gaussians and other PD kernels is an example of this (http://epubs.siam.org/doi/abs/10.1137/110824784, or the more recent http://link.springer.com/article/10.1007/s11075-014-9850-z) which will work well in some circumstances. The reason I mention those first is because I can help you the most with them :), and also because it's probably the best strategy for working with Gaussians that are at their "flat limit" and causing you ill-conditioning.

If your ill-conditioning is caused not by the flat limit but rather because points are being sampled too close together or duplicate points, it's common to instead solve a Hermite interpolation problem. I'm not sure that's something you want to do, but there's a solid discussion of it in Fasshauer's "Meshfree Approximation Methods in Matlab" book.

Another example of a nonsymmetric decomposition of the matrix is a pivoted QR factorization, which is rank revealing and will allow you to avoid degeneracy on nonunisolvent point sets, which can occur at machine precision for scattered data even with positive definite kernels (like how not all point sets have a unique polynomial interpolant). This is discussed in the context of meshfree finite difference stencils http://epubs.siam.org/doi/abs/10.1137/120899108 and probably somewhere else as well, although I can't think of a citation immediately.

But, as you suggest in your discussion above, it is often preferable to work with a symmetric decomposition of your matrix - the SVD and eigenvalue decompositions are equivalent for an SPD matrix so I'll just say SVD. You have mentioned that you can check if U=V, but if you definitely want to require that then I would look for an SVD implementation for SPD matrices, or just an eigenvalue factorization scheme for ill-conditioned matrices. In my mind, if you have ill-conditioning, LDL is probably not the way to go.

The idea of adding a little fudge factor to the diagonal is a common and useful one, originally attributed to Tikhonov regularization and subsequently called Ridge regression. In doing so, you're trying to solve an optimization problem on the interpolant coefficients c rather than an interpolation problem: min_{c} ||k c - y|| + lambda ||c|| for your fudge factor lambda. Depending on which norms you pick, you'll get a different regularization solution, but the idea behind using positive definite kernels is to take advantage of the reproducing property to minimize the associated Hilbert space norm of c. Theoretically, you're supposed to solve the system (K^T K + lambda I)c = y, but for SPD K it's common to solve just (K + lambda I)c = y. Another strategy which is kind of old, but has been popularized by Gene Golub is Riley's algorithm of adding a constant to the diagonal and filtering it out through a von Neumann series. The original reference is http://www.jstor.org/discover/10.2307/2002065?uid=3739256&uid=2&uid=4&sid=21104524873133, but I typed it up in more modern notation (matrices from the 50s are kinda hard to read) on page 5 of http://math.iit.edu/~mccomic/gaussqr/bayesrbf.pdf. You can ignore the rest of that paper, as there is a more advanced version of that crap coming out soon.

As a side note, there has been a good deal of recent work by former Argonne colleagues of mine on decreasing the cost of predicting with Gaussian processes through the use of hierarchically semi-separable decompositions. If you're working with systems bigger than 1000x1000 or so, you might start to see advantages to using this, but the cost of implementing it is ridiculous (even by my academic standards), so don't think that it'll solve problems easily. References: http://www.mcs.anl.gov/papers/P5078-0214.pdf, http://www.mcs.anl.gov/publication/data-structure-and-algorithms-recursively-low-rank-compressed-matrices, http://epubs.siam.org/doi/abs/10.1137/110831143