Closed ningwanyi closed 1 month ago
MLP scales O(N^2L), KAN scales O(N^2LG)
N - number of neurons in layer and layer+1 L - number of layers G - grid size
And we can scale KAN in grid size, paper says that it scales loss by G^-4 (It sounds little to good), but as I known it was tested on small datasets. So, in theory we can get higher accuracy with smaller model. I am writing CUDA implementation, and I see that memory usage can be comparable to MLP.
The original implementation includes G
times more parameters as well. Don't be fooled by the too-complicated code.
I am confused about the principle of KAN. From this implementation, KAN has more learnable parameters? It seems that the improvement of KAN lies in the learnable activation functions, thus achieving better accuracy. Does KAN have any advantage on computation and memory?