General kernel models are predictors of the form $$f(x)=\sum_{i=1}^p \alpha_i K(x,z_i)$$ where $z_i$ are model centers. EigenPro3 requires only $O(p)$ memory, and can use multiple GPUs. Here $K$ can be any positive semidefinite kernel, and $z_i$ can be arbitrary, i.e., they need not be a subset of the training data.
The algorithm is based on Projected dual-preconditioned Stochastic Gradient Descent. A derivation is given in this paper
Title: Toward Large Kernel Models (2023)
Authors: Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit.
TL;DR: Recent studies indicate kernel machines can perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by their equivalence to wide neural networks. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas in traditional kernel machines the model size is tied to the data size. EigenPro3 provides a way forward for constructing large-scale general kernel models that decouple the model and training data.
pip install git+https://github.com/EigenPro/EigenPro3.git
Tested on:
Set an environment variable export DATA_DIR='/path/to/dataset/'
where cifar-10-batches-py
can be downloaded.
from torchvision.datasets import CIFAR10
import os
CIFAR10(os.environ['DATA_DIR'], train=True, download=True)
import torch
from eigenpro3.utils import accuracy, load_dataset
from eigenpro3.datasets import CustomDataset
from eigenpro3.models import KernelModel
from eigenpro3.kernels import laplacian, ntk_relu
p = 5000 # model size
if torch.cuda.is_available():
DEVICES = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
else:
DEVICES = [torch.device('cpu')]
kernel_fn = lambda x, z: laplacian(x, z, bandwidth=20.0)
# kernel_fn = lambda x, z: ntk_relu(x, z, depth=2)
n_classes, (X_train, y_train), (X_test, y_test) = load_dataset('cifar10')
centers = X_train[torch.randperm(X_train.shape[0])[:p]]
testloader = torch.utils.data.DataLoader(
CustomDataset(X_test, y_test.argmax(-1)), batch_size=512,
shuffle=False, pin_memory=True)
model = KernelModel(n_classes, centers, kernel_fn, X=X_train, y=y_train, devices=DEVICES)
model.fit(model.train_loaders, testloader, score_fn=accuracy, epochs=20)
If you want to train your kernel model 1 batch at a time time, you can use your own dataloader and call the fit_batch
method for the model
object.
Refer to the demo FashionMNIST_batched.ipynb where you can use your own dataloader. A similar method fit_epoch
is also provided.
EigenPro2 can only train models of the form $$f(x)=\sum_{i=1}^n \alpha_i K(x,x_i)$$ where $x_i$ are $n$ training samples. It requires $O(n)$ memory and $O(n^2)$ FLOPS per epoch. This does not scale well for large $n$ (billion-scale datasets).
EigenPro3 trains a smaller model of size $p$, requires $O(p)$ memory and $O(np + p^2)$ FLOPS per epoch. This allows scaling to large $n$ (billion-scale), as well as large $p$ (millions of centers $z_i$).
EigenPro3 solves the optimization problem, $$\underset{f\in\mathcal{H}}{\text{argmin}}\quad \sum_{i=1}^n (f(x_i)-yi)^2 \quad \text{subject to}\quad f(x)=\sum{i=1}^p \alpha_i K(x,z_i)\qquad\forall x$$
It applies a dual preconditioner, one for the model and one for the data. It applies a projected-preconditioned SGD $$f^{t+1}=\textrm{proj}_C(f^t - \eta\mathcal{P}(\nabla L(f^t)))$$ where $\nabla L$ is a Fréchet derivative, $\mathcal{P}$ is a preconditioner, and $\textrm{proj}_C$ is a projection operator onto the model space.