Closed JeremyIV closed 4 months ago
Thanks for the Pull Request. A few thoughts/notes for myself :
I am merging the pull request. I'll add some line in Readme to explain this new parameter.
One usual way of dealing with Fourier higher frequency terms, is adding a regularization term which penalize the higher frequencies in the way you want. The merit of that being that the function will be enforced smoothed as training progresses, and not just at initialization.
One thing to study is probably how well is the frequency noise type preserved or changed during training.
Thanks for merging! Here are some quick sloppy experiments in response to your comments:
Regularization
I tried the default initialization with L2 regularization of the fourier coefficients, weighted by f^alpha, for alpha=0,0.5,1,1,5,2,2.5
And here is the power spectra before and after training with alpha=1.5:
Thanks a lot for doing some experiments.
In the KAN paper, they mention doing their experiment with LBFGS, hinting at a second order method.
FourierKAN use cos and sin (Cinf functions), so it can probably benefit from using second order optimizer to take advantage of the curvature.
Something like hessian-free optimization (something like https://github.com/fmeirinhos/pytorch-hessianfree (author warning "Not fully tested, use with caution!") ) should do the trick, and help distinguish optimizing issues from model expressiveness.
Also standard general neural network architecture tricks like resnet, and normalization should also help.
With the default initialization scheme for
fouriercoeffs
, all frequencies draw their coefficients from the same distribution. This means that asgridsize
becomes large, there is more and more contribution from the high frequencies, making KAN's initial scalar functions very high-frequency. In these high-frequency functions, the output values for nearby inputs are uncorrelated. This means that the initial KAN function is highly "scrambled"; and cannot "unscramble" itself during training.For example, here is a KAN with 3 layers, 10 hidden units, and a grid size of 120 trained to encode an image using the coordinate network paradigm, for example see SIREN
Target image
With the default initialization,
Before training:
after training:
With smooth initialization,
Before training:
after training: