KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
13.64k stars 1.2k forks source link

M1 runtime fails with "AssertionError: Torch not compiled with CUDA enabled" #107

Open rmrfxyz opened 2 months ago

rmrfxyz commented 2 months ago

Hi! Thanks a lot for the awesome paper and implementation!

I can't get it to run on my M1 machine. I built pytorch from source, with disabled CUDA options, as per https://github.com/IAMAl/PyTorch4M1 I tried setting device = "cpu" and poked around randomly but I always get the same error while trying to run the examples:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 6
      2 import torch
      4 # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
----> 6 model = KAN(width=[2,3,2,1], device='cpu')
      7 model.to(model.device)
      8 x = torch.normal(0,1,size=(100,2))

File [~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KAN.py:140](https://file+.vscode-resource.vscode-cdn.net/Users/rmrfxyz/dev/chaos/pykan/tutorials/~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KAN.py:140), in KAN.__init__(self, width, grid, k, noise_scale, noise_scale_base, base_fun, symbolic_enabled, bias_trainable, grid_eps, grid_range, sp_trainable, sb_trainable, device, seed)
    137 for l in range(self.depth):
    138     # splines
    139     scale_base = 1 [/](https://file+.vscode-resource.vscode-cdn.net/) np.sqrt(width[l]) + (torch.randn(width[l] * width[l + 1], ) * 2 - 1) * noise_scale_base
--> 140     sp_batch = KANLayer(in_dim=width[l], out_dim=width[l + 1], num=grid, k=k, noise_scale=noise_scale, scale_base=scale_base, scale_sp=1., base_fun=base_fun, grid_eps=grid_eps, grid_range=grid_range, sp_trainable=sp_trainable,
    141                         sb_trainable=sb_trainable, device=device)
    142     self.act_fun.append(sp_batch)
    144     # bias

File [~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KANLayer.py:126](https://file+.vscode-resource.vscode-cdn.net/Users/rmrfxyz/dev/chaos/pykan/tutorials/~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KANLayer.py:126), in KANLayer.__init__(self, in_dim, out_dim, num, k, noise_scale, scale_base, scale_sp, base_fun, grid_eps, grid_range, sp_trainable, sb_trainable, device)
    124     self.scale_base = torch.nn.Parameter(torch.ones(size, device=device) * scale_base).requires_grad_(sb_trainable)  # make scale trainable
    125 else:
--> 126     self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable)
    127 self.scale_sp = torch.nn.Parameter(torch.ones(size, device=device) * scale_sp).requires_grad_(sp_trainable)  # make scale trainable
    128 self.base_fun = base_fun
...
    286     raise AssertionError(
    287         "libcudart functions unavailable. It looks like you have a broken build?"
    288     )

AssertionError: Torch not compiled with CUDA enabled

What am I missing 🤔

AlessandroFlati commented 1 month ago

You should put into the requirements torch==2.3.0+cu121 or whatever cuda version you need.

AlessandroFlati commented 1 month ago

Actually, latest master version is bugged, without https://github.com/KindXiaoming/pykan/pull/98 @KindXiaoming

rmrfxyz commented 1 month ago

But pytorch is built locally and not installed through requirements.txt, as that fails on M1 since there is no CUDA available. So I built it from source and installed it in conda env separately.

I find it confusing that the error says "torch NOT compiled with CUDA", since I have to explicitly disable those options before building - otherwise it fails to install.

So I'm thinking maybe the failure is in the pytorch build, not in pykan... Maybe? I'll try to fiddle with the makefile, maybe I'm overlooking something there.

gonzalalGFM commented 1 month ago

Hi! Yesterday I was able to run in M1 Max chip with the following versions (on anaconda environment) Name Version Build Channel torch 2.3.0 pypi_0 pypi torchaudio 2.3.0 pypi_0 pypi torchvision 0.18.0 pypi_0 pypi It is extremely slow compared with also CPU version in windows. Idk if it makes any difference but I do not send the model via torch just this lines: kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=2, k=3, seed=0) kan_model.train(my_ds, opt="LBFGS", steps=2, lamb=0.01, lamb_entropy=10.)

AlessandroFlati commented 1 month ago

As you see, the problem stands in line self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable) which in a previous (bad) tentative of allowing people to use CUDA, forced the parameter to be on cuda. You can edit that line yourself if you just want to use CPU, but we should really just wait for the PR to be accepted.

gonzalalGFM commented 1 month ago

Also, I'm unable to run any KAN model in GPU. I send to device (cuda) both the dataset and the model but keeps giving me this error: device = torch.device("cuda") dataset = {} dataset["train_input"] = torch.from_numpy(np.array(X_train)) dataset["test_input"] = torch.from_numpy(np.array(X_test)) dataset["train_label"] = torch.from_numpy(np.array(Y_train)) dataset["test_label"] =torch.from_numpy(np.array(Y_test)) for key, value in dataset.items(): dataset[key] = dataset[key].to(device) kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=3, k=3, seed=0, device = device) kan_model.train(dataset, opt="LBFGS", steps=50, lamb=0, lamb_entropy=0, device = device)

Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

rmrfxyz commented 1 month ago

@AlessandroFlati I tried that to change that to mps but didn't work. (didn't expect it to...) Idk pretty far out of my comfort zone, tbh. Alright, glad to hear a PR is in the pipeline, I'll wait for that. Thanks!

@gonzalalGFM Cheers! Maybe I'll give it a try until the PR gets merged.

AlessandroFlati commented 1 month ago

You should actually change it to cpu, not to mps.

AlessandroFlati commented 1 month ago

Also, I'm unable to run any KAN model in GPU. I send to device (cuda) both the dataset and the model but keeps giving me this error: device = torch.device("cuda") dataset = {} dataset["train_input"] = torch.from_numpy(np.array(X_train)) dataset["test_input"] = torch.from_numpy(np.array(X_test)) dataset["train_label"] = torch.from_numpy(np.array(Y_train)) dataset["test_label"] =torch.from_numpy(np.array(Y_test)) for key, value in dataset.items(): dataset[key] = dataset[key].to(device) kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=3, k=3, seed=0, device = device) kan_model.train(dataset, opt="LBFGS", steps=50, lamb=0, lamb_entropy=0, device = device)

Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Also fixed by the PR.

Justin-12138 commented 1 month ago

As you see, the problem stands in line self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable) which in a previous (bad) tentative of allowing people to use CUDA, forced the parameter to be on cuda. You can edit that line yourself if you just want to use CPU, but we should really just wait for the PR to be accepted.

I tried to edit that line using the code from your fork,(device = torch.device('cpu')),but it still AssertionError: Torch not compiled with CUDA enabled

AlessandroFlati commented 1 month ago

That's strange. Could you please create a reproducible gist/snippet where I can try to reproduce your case in order to further expand the PR if needed? That would very much appreciated!

Justin-12138 commented 1 month ago

That's strange. Could you please create a reproducible gist/snippet where I can try to reproduce your case in order to further expand the PR if needed? That would very much appreciated!

Sorry,My falut,I just copied the code in /kan from your fork ,I thought you have editted,I edit those line ,It works,But get some new errors when I ran below:

dataset = {}
train_input, train_label = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=None)
test_input, test_label = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=None)

dataset['train_input'] = torch.from_numpy(train_input)
dataset['test_input'] = torch.from_numpy(test_input)
dataset['train_label'] = torch.from_numpy(train_label[:, None])
dataset['test_label'] = torch.from_numpy(test_label[:, None])
device = torch.device('cpu')
X = dataset['train_input']
y = dataset['train_label']

plt.scatter(X[:, 0], X[:, 1], c=y[:, 0])

model = KAN(width=[2, 1], grid=3, k=3, device=device)

def train_acc():
    return torch.mean((torch.round(model(dataset['train_input'])[:, 0]) == dataset['train_label'][:, 0]).float())

def test_acc():
    return torch.mean((torch.round(model(dataset['test_input'])[:, 0]) == dataset['test_label'][:, 0]).float())

results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
print(results['train_acc'][-1], results['test_acc'][-1])

got errors like this:

description:   0%|                                                           | 0/20 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\JUSTIN200\Desktop\pykan\example\test.py", line 32, in <module>
    results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KAN.py", line 899, in train
    self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KAN.py", line 244, in update_grid_from_samples
    self.forward(x)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KAN.py", line 312, in forward
    x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KANLayer.py", line 175, in forward
    y = coef2curve(x_eval=x, grid=self.grid[self.weight_sharing], coef=self.coef[self.weight_sharing], k=self.k, device=self.device)  # shape (size, batch)
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\spline.py", line 100, in coef2curve
    y_eval = torch.einsum('ij,ijk->ik', coef, B_batch(x_eval, grid, k, device=device))
  File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\functional.py", line 380, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: expected scalar type Double but found Float
os:Windows 11
torch_version:Version: 2.2.2
AlessandroFlati commented 1 month ago

I just think, as the RuntimeError describes, you do not have to cast to float through .float(), or maybe cast it as double

Justin-12138 commented 1 month ago

I just think, as the RuntimeError describes, you do not have to cast to float through .float(), or maybe cast it as double

Thanks,

python

dataset['train_input'] = torch.from_numpy(train_input).float()
dataset['test_input'] = torch.from_numpy(test_input).float()
dataset['train_label'] = torch.from_numpy(train_label[:, None]).float()
dataset['test_label'] = torch.from_numpy(test_label[:, None]).float()

It works for me

latex
train loss: 1.58e-01 | test loss: 1.62e-01 | reg: 1.94e+00 : 100%|██| 20/20 [00:01<00:00, 16.32it/s]
1.0 0.996999979019165