KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
14.79k stars 1.36k forks source link

model can't move to cuda #12

Closed Acetylsalicylsaeure closed 5 months ago

Acetylsalicylsaeure commented 5 months ago

replicating tutorials/API_10_device.ipynb, i see no load on the GPU, just the CPU. VRAM gets occupied, however checking the device of the dataset returns "cuda", the model parameters however return "cpu" as their device. this can be fixed by calling .to(device) on the model, but this breaks the training, leading to the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[19], line 8
      4 dataset = create_dataset(f, n_var=4, train_num=3000, device=device)
      6 # train the model
      7 #model.train(dataset, opt="LBFGS", steps=20, lamb=1e-3, lamb_entropy=2.)
----> 8 model.train(dataset, opt="LBFGS", steps=10, lamb=5e-5, lamb_entropy=2.)

File ~/.conda/envs/pykan/lib/python3.9/site-packages/kan/KAN.py:913, in KAN.train(self, dataset, opt, steps, log, lamb, lamb_l1, lamb_entropy, lamb_coef, lamb_coefdiff, update_grid, grid_update_num, loss_fn, lr, stop_grid_update_step, batch, small_mag_threshold, small_reg_factor, metrics, sglr_avoid, save_fig, in_vars, out_vars, beta, save_fig_freq, img_folder, device)
    910 test_id = np.random.choice(dataset['test_input'].shape[0], batch_size_test, replace=False)
    912 if _ % grid_update_freq == 0 and _ < stop_grid_update_step and update_grid:
--> 913     self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
    916 if opt == "LBFGS":
    917     optimizer.step(closure)

File ~/.conda/envs/pykan/lib/python3.9/site-packages/kan/KAN.py:242, in KAN.update_grid_from_samples(self, x)
    219 '''
    220 update grid from samples
    221 
   (...)
    239 tensor([0.0128, 1.0064, 2.0000, 2.9937, 3.9873, 4.9809])
    240 '''
    241 for l in range(self.depth):
--> 242     self.forward(x)
    243     self.act_fun[l].update_grid_from_samples(self.acts[l])

File ~/.conda/envs/pykan/lib/python3.9/site-packages/kan/KAN.py:313, in KAN.forward(self, x)
    308 self.acts.append(x) # acts shape: (batch, width[l])
    311 for l in range(self.depth):
--> 313     x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
    315     if self.symbolic_enabled == True:
    316         x_symbolic, postacts_symbolic = self.symbolic_fun[l](x)

File ~/.conda/envs/pykan/lib/python3.9/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File ~/.conda/envs/pykan/lib/python3.9/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File ~/.conda/envs/pykan/lib/python3.9/site-packages/kan/KANLayer.py:175, in KANLayer.forward(self, x)
    173 preacts = x.permute(1,0).clone().reshape(batch, self.out_dim, self.in_dim)
    174 base = self.base_fun(x).permute(1,0) # shape (batch, size)
--> 175 y = coef2curve(x_eval=x, grid=self.grid[self.weight_sharing], coef=self.coef[self.weight_sharing], k=self.k) # shape (size, batch)
    176 y = y.permute(1,0) # shape (batch, size)
    177 postspline = y.clone().reshape(batch, self.out_dim, self.in_dim)

File ~/.conda/envs/pykan/lib/python3.9/site-packages/kan/spline.py:99, in coef2curve(x_eval, grid, coef, k, device)
     64 '''
     65 converting B-spline coefficients to B-spline curves. Evaluate x on B-spline curves (summing up B_batch results over B-spline basis).
     66 
   (...)
     95 torch.Size([5, 100])
     96 '''
     97 # x_eval: (size, batch), grid: (size, grid), coef: (size, coef)
     98 # coef: (size, coef), B_batch: (size, coef, batch), summer over coef
---> 99 y_eval = torch.einsum('ij,ijk->ik', coef, B_batch(x_eval, grid, k, device=device))
    100 return y_eval

File ~/.conda/envs/pykan/lib/python3.9/site-packages/torch/functional.py:380, in einsum(*args)
    375     return einsum(equation, *_operands)
    377 if len(operands) <= 2 or not opt_einsum.enabled:
    378     # the path for contracting 0 or 1 time(s) is already optimized
    379     # or the user has disabled using opt_einsum
--> 380     return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    382 path = None
    383 if opt_einsum.is_available():

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

environment: fresh conda venv with the requirements.txt installed cuda version: 12.2

any ideas which parameter could be left behind on the CPU?

sumo43 commented 5 months ago

The KAN class has a "device" argument, try passing "cuda" there. Doesnt speed things up by much though

Acetylsalicylsaeure commented 5 months ago

KAN was called with device=device, device being "cuda", but the parameter device was "cpu". furthermore, i noticed .train can be called with a device argument. setting this to "cuda" also leads to the error above.

Acetylsalicylsaeure commented 5 months ago

The KAN class has a "device" argument, try passing "cuda" there. Doesnt speed things up by much though

just saw your other issue, and i think it's just not offloading anything to the GPU. my system is severely CPU-bottlenecked, and setting the device to cuda does not lead to any speedup whatsoever. furthermore, there's the CPU running at 80%, compared to GPU at 0% in system monitoring

Acetylsalicylsaeure commented 5 months ago

just installed from source, now setting device="cuda" instantly leads to .train failing with the initial error. model parameter device is still "cpu" however? calling .cuda() fixes that, but not the error

Acetylsalicylsaeure commented 5 months ago

looks fixable, pr soon (?)

genglinxiao commented 5 months ago

How much performance improvement do we see in GPU over in CPU? I suspect that since this is quite different from MLP architecture, the improvements come solely from parallel computing, which depends heavily on the implementation itself.

Acetylsalicylsaeure commented 5 months ago

with bigger models it helps quite a lot

model = KAN(width=[4,12, 8,1], grid=10, k=3, seed=0, device=device)

f = lambda x: torch.exp((torch.sin(torch.pi*(x[:,[0]]**2+x[:,[1]]**2))+torch.sin(torch.pi*(x[:,[2]]**2+x[:,[3]]**2)))/2)
dataset = create_dataset(f, n_var=4, train_num=3000, device=device)

# train the model
#model.train(dataset, opt="LBFGS", steps=20, lamb=1e-3, lamb_entropy=2.);
model.train(dataset, opt="LBFGS", steps=50, lamb=5e-5, lamb_entropy=2., device=device)

going from half an hour to 2'40 (tqdm estimate), i.e. 10x speedup and that at 14% GPU usage, but i mentioned my CPU bottleneck

setting width=[4,2,1], CPU takes 49s, GPU 36s

fixed by #7 with setting model.to(device), closing

sumo43 commented 5 months ago

Yep, i ended up just moving everything to cuda manually. Also using adam as the optimizer speeds things up, but it might be less stable