Kolmogorov-Arnold Transformer

SynodicMonth / ChebyKAN

Kolmogorov-Arnold Networks (KAN) using Chebyshev polynomials instead of B-splines.

343 stars 36 forks source link

Kolmogorov-Arnold Transformer #12

Open Adamdad opened 4 weeks ago

Adamdad commented 4 weeks ago

KAN was strong but faced scalability issues. We tackled this with 3 simple tricks. By combining KAN with Transformers, we've built a much stronger and more scalable model. 💪

📄 Paper: https://arxiv.org/abs/2409.10594 💻 Code: https://github.com/Adamdad/kat

K-H-Ismail commented 3 weeks ago

Great work! I am trying to reproduce your results right now. The init as GELU or ReLU trick was very well thought out! Padé approximant was also a trail to follow. I am investigating if other polynomial activations could work with trick number 3 (activation init)

Adamdad commented 3 weeks ago

hello @K-H-Ismail other polynomial activations could be a potential direction to explore. But i think the problem is that with rational function, The denominator serves a certain kind of normalization. If using other polynomial functions, we need to apply additional normalize layers in the network.

K-H-Ismail commented 3 weeks ago

Hello @Adamdad, Yes, exactly, actually there is a lot of literature proving mathematically that the universal approximation property is equivalent to the activation being non-polynomial. However, I found no theory with polynomials initialized as ReLU nor with rational polynomials.

PS: do you accept pull request on https://github.com/Adamdad/kat ?

Adamdad commented 3 weeks ago

Of course, welcome to send pull request😁

K-H-Ismail commented 3 weeks ago

@Adamdad. In your paper, you used a polynomial of degree 5 in the numerator and a polynomial of degree 4 in the denominator. The degree of the rational fraction is then 5-4 = 1. This is like having a polynomial of degree 1 which is homogeneous with ReLU. The degree of the gradient is 1-1= 0, which is homogenous with a constant. I think this is no coincidence.

Other than that, if degree of numerator >> degree of denominator, the activation will suffer from exploding gradients, just like with simple polynomials as the limit is infinite at infinity. If degree of numerator << degree of denominator, then limit goes to 0 in infinity, this might lead to vanishing gradients.

Finally, I think that the choice of the degree of the polynomials should be such that deg(nominator) - deg(denominator) = 1.