Open idelbrid opened 5 years ago
AdditiveStructureKernel
and ProductStructureKernel
intentionally use a factor d
more memory because it handles the structure in batch mode rather than a for loop. This turns out to be extremely important for speed in two cases: (1) SV-DKL with additive structure in the last layer as in the paper, and (2) SKIP.
AdditiveKernel
and ProductKernel
, however, do compute their kernel results in for loops and should not use an additional factor ofd
memory unless the underlying kernels return things other than tensors or non lazy tensors.
If you are running out of memory with AdditiveKernel
, could you try running your code in a gpytorch.settings.lazily_evaluate_kernels(False)
context and seeing if that solves the memory issue? In my opinion, AdditiveKernel
(e.g., sans Structure
) using the factor of d
more memory would be a bug not intended behavior.
Thanks for the reply. Using gpytorch.settings.lazily_evaluate_kernels(False)
doesn't resolve the memory issue with AdditiveKernel
. Using this setting actually seems to use more memory, as I now run out of memory with a smaller checkpoint size (and without checkpointing).
subkernels = [RBFKernel(active_dims=[i]) for i in range(trainX.shape[1])]
add_kernel = ScaleKernel(AdditiveKernel(*subkernels)).to('cuda:0')
with gpytorch.beta_features.checkpoint_kernel(2500), gpytorch.settings.lazily_evaluate_kernels(False):
train(add_kernel) # CUDA out of memory
I'm not entirely sure why this is the case. One guess is that since all kernels are computed separately, maybe a copy of the kernel tensors might be cached during the forward pass for the backward pass.
Also, I forgot to say, this is using gpytorch version 0.3.2, pytorch version 1.1.0.
@idelbrid @jacobrgardner - I think we could add something similar to what you're proposing - where we accumulate the kernel when memory is an issue. I'm not sure exactly where it would fit within the GPyTorch architecture though... maybe AdditiveKernel
could have a "memory efficient" option that would turn off lazy evaluation and accumulate the kernel matrix?
🚀 Feature Request
Motivation
Hi, I'm working on using additive kernels on moderately large datasets, and I've been running into out-of-memory errors due to the way that additive kernels are handled. That is, if we have a d-dimensional data set and have an additive kernel for each dimension, we end up storing d separate N x N kernel matrices, or, using AdditiveStructureKernel, a d x N x N tensor. Using a 12GB GPU, this runs out of memory at for the "pol" UCI data set with N=13500, d=26.
Pitch
In concept, a fully additive kernel shouldn't be significantly more computationally or memory expensive than a non-additive (RBF) kernel. An iterative computation should only take N x N memory instead of d x N x N. My stop-gap solution is to use a custom forward/backward function that uses loops instead of expanding to larger tensors. This is pasted below. Maybe you have better ideas about how it could be integrated into the codebase.
Using this function is much faster than using an
AdditiveStructureKernel
orAdditiveKernel
if it keeps you from checkpointing the kernel.That being said, I don't think it is optimized well - if I use cdist, it's very slow but doesn't use much memory at all, whereas if I manually compute distances, it's much faster but uses much more memory. This implementation also isn't very modular (only RBF permitted, and it isn't implemented to run in batch).
Any thoughts? I'd be willing to make PR if you think something like this would be a good idea. Any suggestions for performance improvement here would be helpful too.
Additional context
Implementation:
Tests: