cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.48k stars 545 forks source link

Encountered nan in confidence bounds #472

Closed Akella17 closed 5 years ago

Akella17 commented 5 years ago

Simple 1D GP Regression using KISS-GP: When I tried to execute some of the example code provided in this repository, I encountered nan in confidence values (lower, upper = prediction.confidence_region()). This trend is more prevalent when I overtrain (for more epochs than specified) or increase the size of the training data.

I wanted to know the reason why this might be occurring, and would also like to know if it would have any implications on the overall performance of GP regression (i.e. the kernel means and covariances have been finite and prolonged training seems to fit them better to the training data despite nans in confidence bounds).

gpleiss commented 5 years ago

Were you following this example: https://gpytorch.readthedocs.io/en/latest/examples/04_Scalable_GP_Regression_1D/KISSGP_Regression_1D.html#?

Can you provide a ipython notebook, or can you print out the hyperparameters learned by the model? It's possible that there might be a numerical instability that you're encountering with certain hyperparameter values.

Also, were you using the gpytorch.settings.fast_pred_var() context manager?

Akella17 commented 5 years ago

I used the GPU-accelerated version of this code: https://gpytorch.readthedocs.io/en/latest/examples/04_Scalable_GP_Regression_1D/KISSGP_Regression_1D_CUDA.html.

The only modification I made was increasing the training set size from 1000 to 10,000 datapoints (this error disappears in some reruns and hence I think this is because of some sort of sensitivity to initial parameters)

I am sharing the drive location to the ipython notebook: https://drive.google.com/file/d/1W4Y9pthOgEb_E59vCJj-hJzRRA46U9uD/view?usp=sharing

gpleiss commented 5 years ago

Can you remove the with gpytorch.settings.fast_pred_var(): context manager and let me know if you see the same error?

Akella17 commented 5 years ago

Yes, that seems to fix the issue. Thanks for the quick response. If you don't mind, I would like to know the prospective reasons why with gpytorch.settings.fast_pred_var(): could have raised an error. I plan on using a more complicated architecture involving stochastic variational inference with SKI for large-scale GP regression and fast predictive distributions, I speculate, will significantly speed-up the evaluation time.

jacobrgardner commented 5 years ago

@Akella17 -- It definitely should work with fast predictive variances, but that test lets us better know that the fast variances are indeed the issue (we are both have trouble reproducing the issue on our end).

A couple more things:

Akella17 commented 5 years ago

This issue occurred on master. It is relieving to hear that SVI with SKI gives O(1) prediction time out of the box. I plan on using this for designing a reinforcement learning algorithm for continuous environments. As this problem demands large scale data (10^6+ datapoints) and stochastic optimization, I was more inclined to use SVI over ExactGP. Also, I was wondering if the library offers sophisticated kernels (like Spectral Mixture Kernel) for sample-efficient but accurate interpolation and prediction.

Also, I would like to know if there is any difference (in terms of performance) between 1) AdditiveGridInducingVariationalGP module: https://gpytorch.readthedocs.io/en/latest/examples/08_Deep_Kernel_Learning/Deep_Kernel_Learning_DenseNet_CIFAR_Tutorial.html 2) AbstractVariationalGP module, AdditiveGridInterpolationVariationalStrategy: https://gpytorch.readthedocs.io/en/latest/examples/07_Scalable_GP_Classification_Multidimensional/KISSGP_Additive_Classification_CUDA.html?highlight=additive 3) ExactGP module, AdditiveStructureKernel wrapped around GridInterpolationKernel: https://gpytorch.readthedocs.io/en/latest/examples/05_Scalable_GP_Regression_Multidimensional/KISSGP_Additive_Regression_CUDA.html?highlight=additive

gpleiss commented 5 years ago

@Akella17 - we do support spectral mixture kernels. There is an example notebook here: https://gpytorch.readthedocs.io/en/latest/examples/01_Simple_GP_Regression/Spectral_Mixture_GP_Regression.html

As far as the differences between models: all three models that you list assume that the underlying function can be expressed as an additive combination of functions across all dimensions.

1) This example uses a deep network for feature extraction. The deep net extracts features that decompose additively across dimensions, making it possible to use additive structure models on data that might not decompose additively in native feature space. This in some sense should be the most powerful model (since it has a deep neural network), but that also involves more parameters to optimize.

2) This is the same as 1, but without a deep network for feature extraction

3) This model is exact regression. This should be the most accurate of all the models, but because it is using exact inference it won't scale quite as well as variational methods. You can probably use this model on datasets up to 100,000-500,000 data points before you start running into memory issues.

As soon as we update our variational inference (in some upcoming PRs), you'll also be able to use SVGP regression. This is an SVI method that makes no assumption about additive decompositions.

jacobrgardner commented 5 years ago

@Akella17 I'm going to close this in favor of tracking the actual bug on #475 if that's alright -- they are the same underlying bug.

For the larger scale problems you're talking about I'd recommend checking out the SVI+DKL regression examples here and here.

When using variational inference, additive structure (which you've already discovered) and deep kernels are the main ways of getting around the dimensionality scaling of SKI. The latter of these notebooks (which uses SVGP with learned inducing point locations) can be run with or without the deep kernel if you'd like, but the deep kernel really does help a lot on large scale data. Either way, these notebooks are the most similar to your setting in that they contain minibatched regression code, which should comfortably let you scale to arbitrarily large scale data.

As Geoff mentioned above, the latter of these two notebooks will be receiving significant improvements once #474 is in.

Akella17 commented 5 years ago

I believe we can implement deep network for feature extraction in AbstractVariationalGP module, AdditiveGridInterpolationVariationalStrategy as well (by changing GIVS to AGIVS in https://gpytorch.readthedocs.io/en/latest/examples/05_Scalable_GP_Regression_Multidimensional/SVDKL_Regression_GridInterp_CUDA.html). So is there no difference (in terms of performance) between 1) AdditiveGridInducingVariationalGP module 2) AbstractVariationalGP module, AdditiveGridInterpolationVariationalStrategy

Also, what is the difference between SVGP and Grid Interpolation Variational Strategy for SV-DKL (which is more effective: learning inducing point locations or grid interpolation)? Moreover, I need some clarification in the example https://gpytorch.readthedocs.io/en/latest/examples/05_Scalable_GP_Regression_Multidimensional/SVDKL_Regression_SVGP_CUDA.html (provides inducing points as input to the GP layer, which intern learns the locations of inducing point).

jacobrgardner commented 5 years ago

@Akella17

  1. Yes, there's no difference between the two modules you mentioned. In fact, the model you mention in 1 is actually the deprecated interface for achieving what you mention in 2 -- if you instantiate one you will get a warning saying it will be removed at a later date. Use the set up you describe in 2.

  2. With variational inference, there isn't as much of a difference between the grid interpolation method and the learned inducing point location method as there is on the exact GP side because both methods require O(m^2) time and space to store and manipulate the variational covariance matrix. In the 2 dimensional output case of SV-DKL, both methods work just fine and have comparable speeds. The new WhitenedVariationalStrategy introduced in #474 may be faster than either GridInterpolationVariationalStrategy or VariationalStrategy.

  3. What clarification do you need for this example? The inducing points that are input to the GP layer are the initial locations of the learned inducing points. In that example notebook, we initialize by passing a small subset of the training data through the initialized neural network and using the extracted features as inducing point initialization.

Akella17 commented 5 years ago

Thanks again for the quick response. I want to know how the inducing point initialization would affect the learning of SVGP, i.e. could a bad initialization drastically affect the learning (as I plan on using the GP layer along with a feature extractor network in reinforcement learning settings, the initial inducing points will almost always be drastically different from learned feature extractor representations). In other words, I want to know if the use of SVGP with learned inducing points in RL settings will make the system more senstive to initial parameterization when compared to grid interpolation method (variational inference).

jacobrgardner commented 5 years ago

In our experience, the initial inducing points being set reasonably mostly affects the speed of convergence (due to how far you are from a good setting) compared to the final model you end up with. I wouldn't expect large variability in your final model stemming from the inducing point initializations.

Akella17 commented 5 years ago

Thanks for the clarification.