Closed mw66 closed 5 months ago
what are you working on
why not .to(devcie) by yourself?
@yuedajiong what do you mean? I pass the device
param to model.train()
, then it should work out of box. That file is a tutorial itself.
BTW, where you put .to(devcie)
, can you show a working example train on GPU?
I'm having a similar issue even when using .to(device)
`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
fngr = torch.tensor(fingerprints).to(device)
labels = torch.tensor(chnops[np.arange(400,4002,2).astype(str)].values).to(device)
my_ds = {"train_input":fngr[:200], "test_input":fngr[200:400], "train_label":labels[:200], "test_label":labels[200:400] }
kan_model = KAN(width=[512,512,1081], grid=5, k=3, seed=0, device=device) kan_model.to(device)
kan_model.train(my_ds, opt="LBFGS", steps=50, lamb=0.01, lamb_entropy=10.); RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
`
The tutorial uses KAN(device=device)
, but this does not put the tensors on the specified device (bug). The commonly used .to(device)
API does though, use that.
For example:
from kan import *
device = 'mps'
[(n, p.device) for n, p in KAN(width=[1,1], device=device).named_parameters()]
[('biases.0.weight', device(type='cpu')),
('act_fun.0.coef', device(type='cpu')),
('act_fun.0.scale_base', device(type='cpu')),
('act_fun.0.scale_sp', device(type='cpu')),
('act_fun.0.mask', device(type='cpu')),
('symbolic_fun.0.mask', device(type='cpu')),
('symbolic_fun.0.affine', device(type='cpu'))]
[(n, p.device) for n, p in KAN(width=[1,1]).to(device).named_parameters()]
[('biases.0.weight', device(type='mps', index=0)),
('act_fun.0.grid', device(type='mps', index=0)),
('act_fun.0.coef', device(type='mps', index=0)),
('act_fun.0.scale_base', device(type='mps', index=0)),
('act_fun.0.scale_sp', device(type='mps', index=0)),
('act_fun.0.mask', device(type='mps', index=0)),
('symbolic_fun.0.mask', device(type='mps', index=0)),
('symbolic_fun.0.affine', device(type='mps', index=0))]
[(n, p.device) for n, p in KANLayer(device=device).named_parameters()]
[('coef', device(type='cpu')),
('scale_base', device(type='cpu')),
('scale_sp', device(type='cpu')),
('mask', device(type='cpu'))]
[(n, p.device) for n, p in KANLayer().to(device).named_parameters()]
[('grid', device(type='mps', index=0)),
('coef', device(type='mps', index=0)),
('scale_base', device(type='mps', index=0)),
('scale_sp', device(type='mps', index=0)),
('mask', device(type='mps', index=0))]
There are still other problems with training on mps
, but at least this lands on the right device.
I've copied what you've done `kan_model = KAN(width=[512,512,1081], grid=5, k=3, seed=0, device=device) kan_model.to(device)
[(n, p.device) for n, p in kan_model.to(device).named_parameters()]`
Both my X and y datasets are cuda tensors and I'm still getting the same error
I can't compare with CUDA right now, no device here, but for me .to(device)
lands on mps
:
from kan import *
device = 'mps'
model = KAN(width=[1,1])
model.to(device)
[(n, p.device) for n, p in model.named_parameters()]
[('biases.0.weight', device(type='mps', index=0)),
('act_fun.0.grid', device(type='mps', index=0)),
('act_fun.0.coef', device(type='mps', index=0)),
('act_fun.0.scale_base', device(type='mps', index=0)),
('act_fun.0.scale_sp', device(type='mps', index=0)),
('act_fun.0.mask', device(type='mps', index=0)),
('symbolic_fun.0.mask', device(type='mps', index=0)),
('symbolic_fun.0.affine', device(type='mps', index=0))]
Okay, I have a mac I'll try on mps, are you working from source on the repo aswell?
Yeah, I did pip install -e .
on 0c79f78
.
I can't compare with CUDA right now, no device here, but for me
.to(device)
lands onmps
Is it possible to run mps
on Linux?
fix, so easy:
fix, so easy:
- to(device_make_sure_unique_same) for inited network(parameters) and inputs
- if still error, check log, 'MAKE_SURE' all DYNAMIC created variables: to(device_make_sure_unique_same) #IMPORTANT
I think this is what Rhys has done already, see the above comments:
https://github.com/KindXiaoming/pykan/issues/52#issuecomment-2094769633
@yuedajiong If you can make it work on cuda, can you submit a PR?
@mw66 please zip all code in your project, I have GPU, and I think I can debug and fix it, then send back to you. if you directlly tried API_10_device.ipynb, please confirm. I will debug this program.
update: 1) soooooooo easy, just in 3 minutes to fix, have done, please check my zip, and search all 'John' 2) the author is a scientist who only focuses on algorithm. I assume he typically experiment on CPU, so you may need to modify the code yourself if you want it to be of very high quality.
you can submit a PR and close this issue now.
@yuedajiong Thanks for the quick fix.
But I think someone did the PR already:
https://github.com/KindXiaoming/pykan/pull/83
and I have tested it, it's working.
I also have tried your kan.zip: while it did train on GPU, the result is not good: with the original code (of API_10_device.ipynb), the CPU version and https://github.com/KindXiaoming/pykan/pull/83 cuda version, the final train result is something like x.xxe-03, e.g.
train loss: 6.48e-03 | test loss: 6.54e-03 | reg: 7.25e+00 : 100%|██| 50/50 [00:22<00:00, 2.20it/s]
While with you version of fix, the final training result is something like x.xxe-01, e.g:
train loss: 5.81e-01 | test loss: 5.82e-01 | reg: 1.41e+01 : 100%|██| 50/50 [00:17<00:00, 2.81it/s]
I have run each of the above scenario 3 times, the result are the same.
Hi,
I tried this:
https://github.com/KindXiaoming/pykan/blob/master/tutorials/API_10_device.ipynb
But
nvidia-smi
shows no GPU usage at all, andtop
shows high CPU usage, looks like it's not training on GPU.Then I add
device=device
:it then errors out: