Pytorch bug - Githubissues

Enqiliu125 commented 5 months ago

Dear Everyone,

I am writing to report a bug I encountered while running symbolic regression using KAN. The issue arose when I adjusted the inputs to 1D inputs, 1D output, and 5 hidden neurons. During the computation, I encountered the following error message: false INTERNAL ASSERT FAILED at "..\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library. It seems that there might be an issue with the backend library implementation in PyTorch.

from kan import *
# create a KAN: 1D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = KAN(width=[1,5,1], grid=5, k=3, seed=0,device='cpu')
A = 3.55*10**15
n = -0.41
E = 16.6  #  J/mol
R = 8.314  #  J/(mol*K)
# create dataset f(x) = A*T^n*exp(-E/(R*T))
f = lambda x: torch.exp(-E/R/x)*A*x**n
dataset = create_dataset(f, n_var=1)
dataset['train_input'].shape, dataset['train_label'].shape
# train the model
model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10.);
model = model.prune()

KindXiaoming commented 5 months ago

Hi, there could be multiple possibilities:

(1) Your dataset is ill-conditioned (e.g., one variable is constant throughout the dataset)
(2) The matrix mat is nearly singular, for whatever reason (maybe due to (1), or maybe lamb or lamb_entropy are too large)

In which training step do you see this error, step 0 or later? If step 0, (1) is more likely. If after a while, (2) is more likely. You may also try to change the driver argument in torch.lstsq, but I don't have a systematic suggestion. You may try all of them. :->

Enqiliu125 commented 5 months ago

I modified the input to 1D inputs, 1D output, and 5 hidden neurons. Despite the lamb value not being large, I am still facing this issue. My goal is to perform a fit similar to the Arrhenius equation for a single-variable complex formula:y=Ax^nexp(-E/R/x). However, when the input is a single variable, I encounter this problem. Additionally, I have formatted my entire code according to the tutorial's instructions. Could you please help me understand the reason behind this issue and how to resolve it?

my code is as following:

from kan import *
import numpy as np
# create a KAN: 1D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = KAN(width=[1,5,1], grid=5, k=3, seed=0,device='cpu')
A = 3.55*10**15
n = -0.41
E = 16.6  #  J/mol
R = 8.314  #  J/(mol*K)
# create dataset f(x) = A*T^n*exp(-E/(R*T))
x=np.linspace(250, 1250, 1000)
f = lambda x: torch.exp(-E/R/x)*A*x**n
dataset = create_dataset(f, n_var=1)
dataset['train_input'].shape, dataset['train_label'].shape
# train the model
model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10.);
model = model.prune()
model(dataset['train_input'])
model.plot()

Here is the error:

runfile('C:/Users/26060/Desktop/kan/kan_try.py', wdir='C:/Users/26060/Desktop/kan')
train loss: nan | test loss: nan | reg: nan :  25%|████▌             | 5/20 [00:03<00:11,  1.32it/s]
Traceback (most recent call last):

  File D:\miniconda\Lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\26060\desktop\kan\kan_try.py:20
    model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10.);

  File D:\miniconda\Lib\site-packages\kan\KAN.py:898 in train
    self.update_grid_from_samples(dataset['train_input'][train_id].to(device))

  File D:\miniconda\Lib\site-packages\kan\KAN.py:244 in update_grid_from_samples
    self.act_fun[l].update_grid_from_samples(self.acts[l])

  File D:\miniconda\Lib\site-packages\kan\KANLayer.py:218 in update_grid_from_samples
    self.coef.data = curve2coef(x_pos, y_eval, self.grid, self.k, device=self.device)

  File D:\miniconda\Lib\site-packages\kan\spline.py:137 in curve2coef
    coef = torch.linalg.lstsq(mat.to('cpu'), y_eval.unsqueeze(dim=2).to('cpu')).solution[:, :, 0]  # sometimes 'cuda' version may diverge

RuntimeError: false INTERNAL ASSERT FAILED at "..\\aten\\src\\ATen\\native\\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel MKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

PerfertVan commented 3 months ago

Hi, have you solved this problem?

Enqiliu125 commented 3 months ago

yes,I have worked it out!

thanubharadwaj commented 3 months ago

Hi, I am facing the same issue. Can you please explain how did you solve the error?

KindXiaoming / pykan

Pytorch bug #242