KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
11.49k stars 952 forks source link

`nan` after `auto_symbolic` and training #89

Closed AmitMY closed 1 week ago

AmitMY commented 1 week ago

I train a KAN using 19 inputs, 5 hidden neurons, and 1 output (I know that I only need a subset of this input, and I was hoping the KAN will tell me which).

I train by refining the grid 5, 10, 20, 50 with sub-optimal results:

train loss: 6.33e+01 | test loss: 6.56e+01 | reg: 2.68e+02 : 100%|██| 50/50 [00:56<00:00,  1.13s/it]
train loss: 6.23e+01 | test loss: 6.81e+01 | reg: 2.95e+02 : 100%|██| 50/50 [01:03<00:00,  1.26s/it]
train loss: 6.14e+01 | test loss: 7.05e+01 | reg: 3.03e+02 : 100%|██| 50/50 [01:27<00:00,  1.76s/it]
train loss: 6.59e+01 | test loss: 2.32e+02 | reg: 6.24e+02 : 100%|██| 50/50 [01:55<00:00,  2.31s/it]

I run auto_symbolic as shown in the tutorial, and get all of the nodes "fixed"

fixing (0,0,0) with x^2, r2=0.9999875381797473
fixing (0,0,1) with log, r2=0.9995437326827541
fixing (0,1,0) with tanh, r2=0.9190223510394673
fixing (0,1,1) with abs, r2=0.591284704723076
fixing (0,2,0) with tanh, r2=0.9859092016712654
fixing (0,2,1) with tanh, r2=0.2644378732312754
fixing (0,3,0) with abs, r2=0.8094657967145165
fixing (0,3,1) with sin, r2=0.20524626203198593
fixing (0,4,0) with sin, r2=0.8596364823055564
fixing (0,4,1) with abs, r2=0.7073148948571956
fixing (0,5,0) with sin, r2=0.8451701555686204
fixing (0,5,1) with tanh, r2=0.12906659532236966
fixing (0,6,0) with sin, r2=0.5258884938324261
fixing (0,6,1) with sin, r2=0.4188728101754649
fixing (0,7,0) with sin, r2=0.9013293203443674
fixing (0,7,1) with tanh, r2=0.7815785647587229
fixing (0,8,0) with tanh, r2=0.9598249300435913
fixing (0,8,1) with sin, r2=0.874354100172932
fixing (0,9,0) with sin, r2=0.8688442004450128
fixing (0,9,1) with sin, r2=0.8079476861665728
fixing (0,10,0) with sin, r2=0.511931841317213
fixing (0,10,1) with abs, r2=0.5395735955860584
fixing (0,11,0) with abs, r2=0.8789512062476048
fixing (0,11,1) with sin, r2=0.9865673967925179
fixing (0,12,0) with x^2, r2=0.9392742452969135
fixing (0,12,1) with sin, r2=0.7497273750201916
fixing (0,13,0) with sin, r2=0.8426054543213266
fixing (0,13,1) with tanh, r2=0.27823616049107963
fixing (0,14,0) with tanh, r2=0.9149009145289683
fixing (0,14,1) with sin, r2=0.8631999116146144
fixing (0,15,0) with sin, r2=0.40641405950097254
fixing (0,15,1) with sin, r2=0.7259583807317198
fixing (0,16,0) with tanh, r2=0.3861879500950182
fixing (0,16,1) with tanh, r2=0.7091527502138572
fixing (0,17,0) with x^2, r2=0.7360096867746375
fixing (0,17,1) with sin, r2=0.9544954283288226
fixing (0,18,0) with sin, r2=0.8808563153924848
fixing (0,18,1) with sin, r2=0.8349780876453031
fixing (1,0,0) with tanh, r2=0.9440596493618962
fixing (1,1,0) with tanh, r2=0.5857742023383322

But then when training again, I immediately get nan:

train loss: nan | test loss: nan | reg: nan :  10%|█▊                | 5/50 [00:08<01:15,  1.67s/it]

Intel MKL ERROR: Parameter 6 was incorrect on entry to DGELSY.

Traceback (most recent call last):
  File "/Users/amitmoryossef/dev/sign-language-processing/mediapipe-hand-crop-fix/mediapipe_crop_estimate/train_kan.py", line 90, in <module>
    model.train(dataset, opt="LBFGS", steps=50)
  File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/kan/KAN.py", line 913, in train
    self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
  File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/kan/KAN.py", line 243, in update_grid_from_samples
    self.act_fun[l].update_grid_from_samples(self.acts[l])
  File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/kan/KANLayer.py", line 220, in update_grid_from_samples
    self.coef.data = curve2coef(x_pos, y_eval, self.grid, self.k)
  File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/kan/spline.py", line 136, in curve2coef
    coef = torch.linalg.lstsq(mat.to('cpu'), y_eval.unsqueeze(dim=2).to('cpu')).solution[:,:,0] # sometimes 'cuda' version may diverge
RuntimeError: false INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp":1539, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

If I only train once, with 5 grids, it does train even after auto_symbolic

yIJunWangg commented 1 week ago

I'm having the same problem bro.QAQ

KindXiaoming commented 1 week ago

Retraining after symbolic regression indeed can be tricky sometime. because of this line

fixing (0,0,1) with log, r2=0.9995437326827541

logairthim is not defined if its input is <=0. Sometimes by converting from LBFGS to Adam would help. This can be done with model.train(opt='Adam'). I mentioned this problem in this examples, and it is unclear to me how to fix this in general.

However, two hacky ways to fix this is: if you don't have strong reasons to keep those potentially singular functions, you could (1) inspect and replace those functions that potentially lead to singularity (log,sqrt, x^-1 etc.) via model.fix_symbolic(0,0,1,f) where f can be any function from the top of the list if you run model.suggest_symbolic(0,0,1). For example, if you found x^2 can also fit (0,0,1) quite well, you may run fix_symbolic(0,0,1,'x^2') and then retriaining should not have a problem because x^2 is not singular. (2) you may remove singular functions from your symbolic library in the first place, e.g., you confine your symbolic formulas only to sin, squared, and exp.

lib = ['sin', 'x^2', 'exp']
model.auto_symbolic(lib=lib)
e-tuanzi commented 1 week ago

I had the same problem when I tried to run hellokan.ipynb, but it worked with the optimizer set to Adam. model.train(dataset, opt="Adam", steps=50);

yi1z commented 1 week ago

I think the nan values come from overflow values after a few iterations under some setups, for example if a large k is used, the following lines would report that there is an illegal value:

[c:\...\pykan\venv\Lib\site-packages\kan\KANLayer.py:220](file:///C:/Desktop/School/Papers/extra/pykan/venv/Lib/site-packages/kan/KANLayer.py:220), in KANLayer.update_grid_from_samples(self, x)
    [218] grid_uniform = torch.cat([grid_adaptive[:, [0]] - margin + (grid_adaptive[:, [-1]] - grid_adaptive[:, [0]] + 2 * margin) * a for a in np.linspace(0, 1, num=self.grid.shape[1])], dim=1)
    [219] self.grid.data = self.grid_eps * grid_uniform + (1 - self.grid_eps) * grid_adaptive
--> [220] self.coef.data = curve2coef(x_pos, y_eval, self.grid, self.k, device=self.device)
...
    [134] mat = B_batch(x_eval, grid, k, device=device).permute(0, 2, 1)
--> [135] coef = torch.linalg.lstsq(mat.to('cpu'), y_eval.unsqueeze(dim=2).to('cpu')).solution[:, :, 0]  # sometimes 'cuda' version may diverge
    [136] return coef.to(device)

Not sure how this can occur though, there are way too many code to go through xd