Open JanRocketMan opened 6 months ago
You're right.
Really appreciate ur suggestion.
2 is a great question, as I understand this:
We have two separate choices - which nonlinearity to consider and whether we apply it on edges and then sum to nodes or directly on nodes.
If the nonlinearity is Cheby, then it doesn't matter whether we apply it on nodes or edges - we can always fuse rest of operations to a single nn.Linear.
If the nonlinearity is grid-based (like B-splines in KAN or smoothing splines), then in case of activation on edges we can't easily fuse computations in nn.Linear, because each edge will have different basis. In principle we can expand input to this larger basis set but it's gonna be very expensive (naively (degree + 1) * out_channels times). Maybe it's possible to fuse this expansion with subsequent Linear in a single op, but that would require writing custom CUDA kernels a-la Flash Attention.
On a more positive side, maybe we can share grid across different output channels (I believe efficient-kan is doing this) and it would be enough. But if we don't use grids at all I feel like it would't make any difference to GLUs. Maybe I'm wrong.
Indeed. cheby/fourier/legendre/hermite/laguerre are all the same here. They're all equivalent to custom activation + nn.Linear. So thats why KAN uses grid-based. Reaaaaaaaaally impressive.
Hi, very interesting idea, kudos!
I believe the proposed layer is equivalent to the following combination (I fix degree to be 4 for simplicity):
This makes it a variant of LAN network (see app. B2 in KAN paper), which is nice, but it's a double-edged sword.
On one side, with this rewrite you can train it pretty efficiently (by checkpointing ChebyActivation function and using optimized cuda Linear kernel).
On the other side, modern networks like LLAMA3 already use Gated Linear Unit activations, which should give roughly equivalent representation (I'm not 100% sure on this point tho).
Do you think it's correct reasoning or maybe I'm missing smth?
Thanks in advance!