Blealtan / efficient-kan

An efficient pure-PyTorch implementation of Kolmogorov-Arnold Network (KAN).
MIT License
3.49k stars 306 forks source link

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. #20

Open wza13 opened 1 month ago

wza13 commented 1 month ago

D:\Users\12719\anaconda3\python.exe D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py 20%|██ | 20/100 [00:01<00:06, 12.66it/s, mse_loss=nan, reg_loss=nan] Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. 20%|██ | 20/100 [00:02<00:08, 9.82it/s, mse_loss=nan, reg_loss=nan] Traceback (most recent call last): File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 35, in test_mul() File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 29, in test_mul optimizer.step(closure) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\optimizer.py", line 459, in wrapper out = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\lbfgs.py", line 320, in step orig_loss = closure() ^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 18, in closure y = kan(x, update_grid=(i % 20 == 0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 272, in forward layer.update_grid(x) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 210, in update_grid self.splineweight.data.copy(self.curve2coeff(x, unreduced_spline_output)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 131, in curve2coeff solution = torch.linalg.lstsq( ^^^^^^^^^^^^^^^^^^^ RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

LIWEIDENG0830 commented 1 month ago

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

Indoxer commented 1 month ago

D:\Users\12719\anaconda3\python.exe D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py 20%|██ | 20/100 [00:01<00:06, 12.66it/s, mse_loss=nan, reg_loss=nan] Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY.

Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. 20%|██ | 20/100 [00:02<00:08, 9.82it/s, mse_loss=nan, reg_loss=nan] Traceback (most recent call last): File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 35, in test_mul() File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 29, in test_mul optimizer.step(closure) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\optimizer.py", line 459, in wrapper out = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\lbfgs.py", line 320, in step orig_loss = closure() ^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 18, in closure y = kan(x, update_grid=(i % 20 == 0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 272, in forward layer.update_grid(x) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 210, in update_grid self.splineweight.data.copy(self.curve2coeff(x, unreduced_spline_output)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 131, in curve2coeff solution = torch.linalg.lstsq( ^^^^^^^^^^^^^^^^^^^ RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

Sounds like https://github.com/KindXiaoming/pykan/issues/170. changing driver in code may help.

LIWEIDENG0830 commented 1 month ago

D:\Users\12719\anaconda3\python.exe D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py 20%|██ | 20/100 [00:01<00:06, 12.66it/s, mse_loss=nan, reg_loss=nan] Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. Intel oneMKL ERROR: Parameter 6 was incorrect on entry to SGELSY. 20%|██ | 20/100 [00:02<00:08, 9.82it/s, mse_loss=nan, reg_loss=nan] Traceback (most recent call last): File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 35, in test_mul() File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 29, in test_mul optimizer.step(closure) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\optimizer.py", line 459, in wrapper out = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\optim\lbfgs.py", line 320, in step orig_loss = closure() ^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\tests\test_simple_math.py", line 18, in closure y = kan(x, update_grid=(i % 20 == 0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 272, in forward layer.update_grid(x) File "D:\Users\12719\anaconda3\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 210, in update_grid self.splineweight.data.copy(self.curve2coeff(x, unreduced_spline_output)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Users\12719\PycharmProjects\efficient-kan\src\efficient_kan\kan.py", line 131, in curve2coeff solution = torch.linalg.lstsq( ^^^^^^^^^^^^^^^^^^^ RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp":1538, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

Sounds like KindXiaoming/pykan#170. changing driver in code may help.

Hi Indoxer, thanks for your kind help! It looks like the same problem with in pykan. However, I try to change the driver in lstsq as solution = torch.linalg.lstsq( A, B, driver='gelsy' ).solution and run on CPU. It does not work in my situation.

Xu-backup commented 1 month ago

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

LIWEIDENG0830 commented 1 month ago

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

Okkkk. Thanks Xu. It works!

boxaio commented 1 month ago

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

the above error happened when updating the grid, so how is this related to the explosion of B?

Xu-backup commented 1 month ago

Hi bro, do you solve this problem? I have the same output when running the test_simple_math.py.

This is because the learning rate is too high(lr = 1) in that example and B turns to Nan in learning. Try to turn it lower may help you fix it.

the above error happened when updating the grid, so how is this related to the explosion of B?

I am not actually find why it happend. But i find B = y.transpose(0, 1) in the code, firstly y turns nan, so it maybe some places have been divided by a number close to 0. Because in high lr you may easily get a abnormal param.