KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
14.94k stars 1.38k forks source link

Runtime Error in hellokan.ipynb #117

Closed e-tuanzi closed 5 months ago

e-tuanzi commented 5 months ago

When I run hellokan.ipynb, a runtime error occurs due to Nan. I didn't change any Settings, just ran it, repeated it 10 times and still got a runtime error.

fixing (0,0,0) with sin, r2=0.9999399781227112
fixing (0,0,1) with sin, r2=0.9857487678527832
fixing (0,1,0) with x^2, r2=0.9999935626983643
fixing (0,1,1) with tanh, r2=0.9972401857376099
fixing (1,0,0) with exp, r2=0.9999949932098389
fixing (1,1,0) with exp, r2=0.8201351165771484

These functions do this every time, and I don't know why the following three functions are found in hellokan.ipynb.

fixing (0,0,0) with sin, r2=0.999987252534279
fixing (0,1,0) with x^2, r2=0.9999996536741071
fixing (1,0,0) with exp, r2=0.9999988529417926

After that, A nan runtime error occurred during the last training.

train loss: nan | test loss: nan | reg: nan : 10%|█▊ | 5/50 [00:00<00:07, 5.86it/s]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 model.train(dataset, opt="LBFGS", steps=50)

File ...\pykan\kan\KAN.py:898, in KAN.train(self, dataset, opt, steps, log, lamb, lamb_l1, lamb_entropy, lamb_coef, lamb_coefdiff, update_grid, grid_update_num, loss_fn, lr, stop_grid_update_step, batch, small_mag_threshold, small_reg_factor, metrics, sglr_avoid, save_fig, in_vars, out_vars, beta, save_fig_freq, img_folder, device)
    895 test_id = np.random.choice(dataset['test_input'].shape[0], batch_size_test, replace=False)
    897 if _ % grid_update_freq == 0 and _ < stop_grid_update_step and update_grid:
--> 898     self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
    900 if opt == "LBFGS":
    901     optimizer.step(closure)

File ...\pykan\kan\KAN.py:244, in KAN.update_grid_from_samples(self, x)
    242 for l in range(self.depth):
    243     self.forward(x)
--> 244     self.act_fun[l].update_grid_from_samples(self.acts[l])

File ...\pykan\kan\KANLayer.py:218, in KANLayer.update_grid_from_samples(self, x)
    216 grid_uniform = torch.cat([grid_adaptive[:, [0]] - margin + (grid_adaptive[:, [-1]] - grid_adaptive[:, [0]] + 2 * margin) * a for a in np.linspace(0, 1, num=self.grid.shape[1])], dim=1)
    217 self.grid.data = self.grid_eps * grid_uniform + (1 - self.grid_eps) * grid_adaptive
--> 218 self.coef.data = curve2coef(x_pos, y_eval, self.grid, self.k, device=self.device)

File ...\pykan\kan\spline.py:135, in curve2coef(x_eval, y_eval, grid, k, device)
    133 # x_eval: (size, batch); y_eval: (size, batch); grid: (size, grid); k: scalar
    134 mat = B_batch(x_eval, grid, k, device=device).permute(0, 2, 1)
--> 135 coef = torch.linalg.lstsq(mat.to('cpu'), y_eval.unsqueeze(dim=2).to('cpu')).solution[:, :, 0]  # sometimes 'cuda' version may diverge
    136 return coef.to(device)

RuntimeError: false INTERNAL ASSERT FAILED at "...\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\BatchLinearAlgebra.cpp":1540, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

When I set the last training optimizer to Adam, it worked fine.

model.train(dataset, opt="Adam", steps=50);

Does anyone else have the same problem?Or can someone tell me why this happened? Thanks a lot!

AntonioTepsich commented 5 months ago

Retraining after symbolic regression can indeed lead to errors sometimes, particularly because of this part of the code:

mode = "auto" # "manual"

if mode == "manual":
    # manual mode
    model.fix_symbolic(0,0,0,'sin');
    model.fix_symbolic(0,1,0,'x^2');
    model.fix_symbolic(1,0,0,'exp');
elif mode == "auto":
    # automatic mode
    lib = ['x','x^2','x^3','x^4','exp','log','sqrt','tanh','sin','abs']
    model.auto_symbolic(lib=lib)

In your case, the logarithm is not defined for inputs that are less than or equal to zero. That’s why switching from LBFGS to Adam sometimes helps.

Remember to clear all your outputs after changing your optimizer and retraining from scratch.

e-tuanzi commented 5 months ago

@AntonioTepsich Thank you very much for solving my question.

I also found a reason why repeating a dozen runs is always the same result.

jupyter notebook running repeatedly if you choose to restart the kernel running mode will always cause the same result. This may have something to do with randomly initialized variables.

jupyter can be run from scratch without restarting the kernel.