Closed SimoSbara closed 5 months ago
This should be fixed with https://github.com/KindXiaoming/pykan/pull/98 (please try)
I'll give you some news: RuntimeError: expected scalar type Double but found Float
was my fault because the normalized images were saved in dtype=float64
About the device choice for training, now CUDA works fine without model.to(device)
but forcing CPU gives an error:
Traceback (most recent call last):
File "/root/Progetti_GIT/OCR/kanocr.py", line 86, in <module>
results = model.train(dataset, opt="Adam", steps=3, save_fig_freq=0, batch=16, device=device)# metrics=(train_acc, test_acc))
File "/root/miniconda3/envs/tf/lib/python3.9/site-packages/kan/KAN.py", line 898, in train
self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
File "/root/miniconda3/envs/tf/lib/python3.9/site-packages/kan/KAN.py", line 243, in update_grid_from_samples
self.forward(x)
File "/root/miniconda3/envs/tf/lib/python3.9/site-packages/kan/KAN.py", line 311, in forward
x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
File "/root/miniconda3/envs/tf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/tf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/tf/lib/python3.9/site-packages/kan/KANLayer.py", line 176, in forward
y = self.scale_base.unsqueeze(dim=0) * base + self.scale_sp.unsqueeze(dim=0) * y
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
The problem is in line 126 in KANLayer.py:
if isinstance(scale_base, float):
self.scale_base = torch.nn.Parameter(torch.ones(size, device=device) * scale_base).requires_grad_(sb_trainable) # make scale trainable
else:
self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable)
It is forcing .cuda() even if you are using CPU. A temporary solution could be this:
if isinstance(scale_base, float):
self.scale_base = torch.nn.Parameter(torch.ones(size, device=device) * scale_base).requires_grad_(sb_trainable) # make scale trainable
else:
self.scale_base = torch.nn.Parameter(torch.tensor(scale_base, device=device)).requires_grad_(sb_trainable)
Could you please take the time to review https://github.com/KindXiaoming/pykan/pull/98? Because it does exactly what you mentioned and more. Probably you could have saved some migraine!
Could you please take the time to review #98? Because it does exactly what you mentioned and more. Probably you could have saved some migraine!
Wooops, sorry XD. Yes, it does work!
Side note, the CPU somehow outperforms CUDA in performance, maybe it is like that becuase I'm using a small dataset?
CPU:
train loss: 1.85e-01 | test loss: 1.90e+04 | reg: 1.41e+05 : 33%|█▎ | 1/3 [00:18<00:36, 18.19s/it]
CUDA:
train loss: 1.83e-01 | test loss: 2.18e+04 | reg: 1.42e+05 : 33%|█▎ | 1/3 [00:44<01:28, 44.34s/it]
Other than this, it works just fine.
Yeah we already noticed the slowness of CUDA for small datasets. I didn't have time to test it on big datasets yet, but I think it's actually due to a part of the code where memory is passed forcefully to CPU for performing lstgs. I will open another issue/PR as soon as I have more info and time to dedicate :)
Hello, I'm trying to train a kan net for OCR. In the process I had to do a few tweaks for being able to use the GPU.
torch.set_default_dtype(torch.float64) inside __init__.py
otherwise I getRuntimeError: expected scalar type Double but found Float
also this:
and all data
this is the entire script:
After that the GPU goes on even if it is still very slow...