CG80499 / KAN-GPT-2

Training small GPT-2 style models using Kolmogorov-Arnold networks.
108 stars 5 forks source link

Training small GPT-2 style models using KANs instead of MLPs in JAX

This repository compares transformers using multilayer perceptron (MLP) and Kolmogorov-Arnold networks (KAN) layers.

Key points:

Results:

They both achieve a final loss of ~2.46 (despite the KAN model having 25% fewer parameters!). image

Hyperparameters:

Hardware: Single 1080ti GPU

Wandb: link.