KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
14.94k stars 1.38k forks source link

model.train(device='cuda') is not working: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #52

Closed mw66 closed 5 months ago

mw66 commented 5 months ago

Hi,

I tried this:

https://github.com/KindXiaoming/pykan/blob/master/tutorials/API_10_device.ipynb

But nvidia-smi shows no GPU usage at all, and top shows high CPU usage, looks like it's not training on GPU.

Then I add device=device:

model.train(dataset, opt="LBFGS", steps=50, lamb=5e-5, lamb_entropy=2., device=device);

it then errors out:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
goknurarican commented 5 months ago

what are you working on

yuedajiong commented 5 months ago

why not .to(devcie) by yourself?

mw66 commented 5 months ago

@yuedajiong what do you mean? I pass the device param to model.train(), then it should work out of box. That file is a tutorial itself.

BTW, where you put .to(devcie), can you show a working example train on GPU?

Rhys-McAlister commented 5 months ago

I'm having a similar issue even when using .to(device)

`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

fngr = torch.tensor(fingerprints).to(device)

labels = torch.tensor(chnops[np.arange(400,4002,2).astype(str)].values).to(device)

my_ds = {"train_input":fngr[:200], "test_input":fngr[200:400], "train_label":labels[:200], "test_label":labels[200:400] }

kan_model = KAN(width=[512,512,1081], grid=5, k=3, seed=0, device=device) kan_model.to(device)

kan_model.train(my_ds, opt="LBFGS", steps=50, lamb=0.01, lamb_entropy=10.); RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

`

leoauri commented 5 months ago

The tutorial uses KAN(device=device), but this does not put the tensors on the specified device (bug). The commonly used .to(device) API does though, use that.

For example:

from kan import *
device = 'mps'
[(n, p.device) for n, p in KAN(width=[1,1], device=device).named_parameters()]
[('biases.0.weight', device(type='cpu')),
 ('act_fun.0.coef', device(type='cpu')),
 ('act_fun.0.scale_base', device(type='cpu')),
 ('act_fun.0.scale_sp', device(type='cpu')),
 ('act_fun.0.mask', device(type='cpu')),
 ('symbolic_fun.0.mask', device(type='cpu')),
 ('symbolic_fun.0.affine', device(type='cpu'))]
[(n, p.device) for n, p in KAN(width=[1,1]).to(device).named_parameters()]
[('biases.0.weight', device(type='mps', index=0)),
 ('act_fun.0.grid', device(type='mps', index=0)),
 ('act_fun.0.coef', device(type='mps', index=0)),
 ('act_fun.0.scale_base', device(type='mps', index=0)),
 ('act_fun.0.scale_sp', device(type='mps', index=0)),
 ('act_fun.0.mask', device(type='mps', index=0)),
 ('symbolic_fun.0.mask', device(type='mps', index=0)),
 ('symbolic_fun.0.affine', device(type='mps', index=0))]
[(n, p.device) for n, p in KANLayer(device=device).named_parameters()]
[('coef', device(type='cpu')),
 ('scale_base', device(type='cpu')),
 ('scale_sp', device(type='cpu')),
 ('mask', device(type='cpu'))]
[(n, p.device) for n, p in KANLayer().to(device).named_parameters()]
[('grid', device(type='mps', index=0)),
 ('coef', device(type='mps', index=0)),
 ('scale_base', device(type='mps', index=0)),
 ('scale_sp', device(type='mps', index=0)),
 ('mask', device(type='mps', index=0))]

There are still other problems with training on mps, but at least this lands on the right device.

Rhys-McAlister commented 5 months ago

I've copied what you've done `kan_model = KAN(width=[512,512,1081], grid=5, k=3, seed=0, device=device) kan_model.to(device)

[(n, p.device) for n, p in kan_model.to(device).named_parameters()]`

Both my X and y datasets are cuda tensors and I'm still getting the same error

leoauri commented 5 months ago

I can't compare with CUDA right now, no device here, but for me .to(device) lands on mps:

from kan import *
device = 'mps'
model = KAN(width=[1,1])
model.to(device)
[(n, p.device) for n, p in model.named_parameters()]
[('biases.0.weight', device(type='mps', index=0)),
 ('act_fun.0.grid', device(type='mps', index=0)),
 ('act_fun.0.coef', device(type='mps', index=0)),
 ('act_fun.0.scale_base', device(type='mps', index=0)),
 ('act_fun.0.scale_sp', device(type='mps', index=0)),
 ('act_fun.0.mask', device(type='mps', index=0)),
 ('symbolic_fun.0.mask', device(type='mps', index=0)),
 ('symbolic_fun.0.affine', device(type='mps', index=0))]
Rhys-McAlister commented 5 months ago

Okay, I have a mac I'll try on mps, are you working from source on the repo aswell?

leoauri commented 5 months ago

Yeah, I did pip install -e . on 0c79f78.

mw66 commented 5 months ago

I can't compare with CUDA right now, no device here, but for me .to(device) lands on mps

Is it possible to run mps on Linux?

yuedajiong commented 5 months ago

fix, so easy:

  1. to(device_make_sure_unique_same) for inited network(parameters) and inputs
  2. if still error, check log, 'MAKE_SURE' all DYNAMIC created variables: to(device_make_sure_unique_same) #IMPORTANT
mw66 commented 5 months ago

fix, so easy:

  1. to(device_make_sure_unique_same) for inited network(parameters) and inputs
  2. if still error, check log, 'MAKE_SURE' all DYNAMIC created variables: to(device_make_sure_unique_same) #IMPORTANT

I think this is what Rhys has done already, see the above comments:

https://github.com/KindXiaoming/pykan/issues/52#issuecomment-2094769633

@yuedajiong If you can make it work on cuda, can you submit a PR?

yuedajiong commented 5 months ago

@mw66 please zip all code in your project, I have GPU, and I think I can debug and fix it, then send back to you. if you directlly tried API_10_device.ipynb, please confirm. I will debug this program.

update: 1) soooooooo easy, just in 3 minutes to fix, have done, please check my zip, and search all 'John' 2) the author is a scientist who only focuses on algorithm. I assume he typically experiment on CPU, so you may need to modify the code yourself if you want it to be of very high quality.

kan.zip

you can submit a PR and close this issue now.

mw66 commented 5 months ago

@yuedajiong Thanks for the quick fix.

But I think someone did the PR already:

https://github.com/KindXiaoming/pykan/pull/83

and I have tested it, it's working.

I also have tried your kan.zip: while it did train on GPU, the result is not good: with the original code (of API_10_device.ipynb), the CPU version and https://github.com/KindXiaoming/pykan/pull/83 cuda version, the final train result is something like x.xxe-03, e.g.

train loss: 6.48e-03 | test loss: 6.54e-03 | reg: 7.25e+00 : 100%|██| 50/50 [00:22<00:00,  2.20it/s]

While with you version of fix, the final training result is something like x.xxe-01, e.g:

train loss: 5.81e-01 | test loss: 5.82e-01 | reg: 1.41e+01 : 100%|██| 50/50 [00:17<00:00,  2.81it/s]

I have run each of the above scenario 3 times, the result are the same.